Voice samples preprocessing
After data collection, the collected voice samples were pre-processed and features were extracted using MFCC. Pre-processing of speech plays an important role in the development of an efficient automatic speech/speaker recognition systems. Pre-processing is an important step for ML algorithms to produce better results. In speech processing, the pre-processing includes noise cancellation, pre-emphasis and silence removal. Pre-processing facilitates the voice-based recognition systems to be computationally efficient [32]. Due to the characteristics of the human vocal system, glottal airflow and lip radiations depress higher frequency components of the voiced part of the sound signal. For the voiced sound signal, the glottal pulse has a slope of approximately − 12dB/octave, and the lip radiation has a slop of approximately + 6dB/octave. Resultantly sound signal introduces a slope of − 6 dB/octave down-ward if compared with the spectrum of vocal tract. The process to remove the slope of − 6 dB /octave is known as pre-emphasis and it removes the effects of the glottal pulse from actual vocal tract and balance power spectrum dynamics. Thus, all voice samples have been passed through a high pass filter. It amplifies high-frequency components with respect to low-frequency components. The pre-emphasis method ensures that all formants of the voice signal have identical amplitude so that have equal importance in subsequent processing steps [33]. After pre-emphasis, the voice samples were further processed to remove silence using amplitude based silence removal technique [34] This technique divides the whole audio sample into components of short fixed length called frames and calculates maximum amplitude of each frame. It then finds those frames with the maximum amplitude is greater than 0.03 and considers those frames as voice portion of the speech and discards the frames with lesser amplitude then 0.03. This technique assumes that the silent part of the voice signal has amplitude \(<0.03\) and the voice part of voice signal contains amplitude \(> 0.03\). The pre-processing is shown in Fig. 4.
As per Fig. 4, for pre-processing of voice samples, initially, each voice sample was pre-emphasized by passing through a first order high pass filter. Each pre-emphasized voice sample is then divided into 20 ms to 30 ms duration overlapping frames for analysis. Each frame is then analyzed using Hamming window. Based on the amplitude of each windowed frame, frames with silence and without silence were separated and frames with silence were removed.
Features extraction
After pre-processing, MFCC features were extracted from all voice samples. Feature extraction is the next important step after pre-processing for developing voice based recognition systems. The output from the feature extraction process is the main input for speaker model development and matching processes.
The MFCC technique is a most popular, has a huge achievement and extensively used in the speaker and speech recognition systems [35, 36]. It is based on a logarithmic scale and is able to estimates human auditory response in a better way than the other cepstral feature extraction techniques [37, 38]. MFCC features are derived from short-term Fast Fourier Transform (FFT) power spectrum of the pre-emphasized input speech samples. To obtain these features, initially, the pre-emphasized input voice sample is divided into fixed length segments known as frames. The purpose of framing is to analyze the input voice sample in nearly non-varying/static form (by nature, speech signals are non-stationary). After framing each frame is passed through a window (hamming window in our case) to analyze. Each frame is analysed by applying window to remove discontinues at the beginning and the end of each frame. The resulting speech sample is then transformed into frequency domain from time domain by simply applying FFT. Transformed signal values are then plotted against the Mel scale (Mel-scale has linear frequency spacing lower than 1000 Hz and a logarithmic frequency interval higher than 1000 Hz) [39]. Finally, MFCC coefficients are obtained by using a Discrete Cosine Transform (DCT) of the logarithm of the power on each Mel frequency.
The process of feature extraction is shown in Fig. 5. In this research, 19 MFCC were obtained and used as feature vectors for ML algorithms.
Classification models
After extraction of voice features (MFCCs), some of the popular ML algorithms relevant and suitable to the present research such as SVM, ANN, k-NN, and RF [40,41,42] were trained using training set of feature vectors and tested on the set of test feature vectors. To build learning models and for train-test splits of feature vectors, ten-fold cross-validation (TFCV) technique was used. In TFCV method, the original feature vector is randomly divided into ten nearly equal sub samples. One of the ten samples is randomly chosen as a test feature vector, whereas, all the remaining sub-samples are used as training feature vectors. Similar, process is repeated until all the ten sub-samples of feature vectors have been tested [43]. In this procedure, since, the classification accuracy is based on ten estimates rather than just a single estimate, therefore, TFCV produces a more precise estimate of classification accuracy than cross validation [44].
SVM is a supervised machine learning model [45] that analyzes data and recognizes patterns, used for regression and classification. It is a well-known discriminative classifier that models the boundary between the speaker and a group of impostors. We implemented a Sequential Minimal Optimization (SMO) algorithm to train SVM classifier [46].
K-NN algorithm is a family of lazy and instance-based learning algorithms. Whenever it is necessary to classify a sample of unknown data from a test data set, the KNN’s task is to examine training data set for the most related k samples. Instance based classification algorithm, provide an efficient implementation of the KNNs [47]. We used euclidean distance based approach for KNN implementation [48].
RF belongs to a family of supervised learning model, used for the task of classification and regression. RF works by constructing a huge amount of decision trees during the training phase and producing a class that is the mode of the classes. RF is a very precise and robust classifier and is not subject to overfitting problem [49]. We implemented RF with bootstrapping of samples technique [40] in this research.
ANN is used in a wide range of applications. It consists of a collection of various neurons often called nodes of network connection and is a simplified version of the human brain [50]. It consists of an input, hidden, and output layers. Its objective is to get inputs and transform them into meaningful outputs. Here in this research, Multilayer Perceptron (MLP) algorithm of ANN with back propagation feed-forward algorithm is implimentd to classify instances and log-sigmoid function as a neuron activation function [51].
Performance evaluation measurements
The behavior of each classification model is assessed on the basis of certain parameters for measuring its efficiency. The performance of model is influenced by the training data size, the quality of voice records, and most significantly the type of ML model used. We have used the following measurement matrices to assess the efficiency of the ML models [52, 53]:
Accuracy: It shows how frequently the classifier predicts the correct values and can be calculated as:
$$Accuracy= \frac{TP+TN}{TP+TN+FP+FN}$$
(1)
Precision: The segment of the relevant examples among the retrieved examples. The precision can be calculated as:
$$Precision= \frac{TP}{TP+FP}$$
(2)
Recall: The segment of relevant examples, which are retrieved from the total relevant examples and can be defined mathematically as bellow:
$$Recall = \frac{TP}{TP+FN}$$
(3)
F-measure: It is the harmonic mean of the precision and recall. It can be expressed mathematically as bellow:
$$F{\text{-}}measure = \frac{2*Precision*Recall}{Precision+Recall}$$
(4)
where TP is the number of samples predicted as positive that are actually positive; FP is the number of samples predicted as positive that are actually negative; TN is the number of samples predicted as negative that are actually negative; FN is the number of samples predicted as negative that are actuall positive
Root Mean Squared Error: It is used to measure the mean magnitude of the errors in an experiment using a quadratic scoring rule and it can be calculated as bellow::
$$RMSE= \sqrt{\frac{1}{N}\sum _{i=1}^{N}{(y_i-\hat{y_i})}^2}$$
(5)
where \(\hat{y_i}\) is the estimate of \(y_i\).