Development of a regional voice dataset and speaker classification based on machine learning

At present, voice biometrics are commonly used for identification and authentication of users through their voice. Voice based services such as mobile banking, access to personal devices, and logging into social networks are the common examples of authenticating users through voice biometrics. In Pakistan, voice-based services are very common in banking and mobile/cellular sector, however, these services do not use voice features to recognize customers. Therefore, the chance to use these services with false identity is always high. It is essential to design a voice-based recognition system to minimize the risk of false identity. In this paper, we developed regional voice datasets for voice biometrics, by collecting voice data in different local accents of Pakistan. Although, there is a global need for voice biometrics especially when voice-based services are common, however, this paper uses Pakistan as a use case to show how to build regional voice dataset for voice biometrics. To build voice dataset, voice samples were recorded from 180 male and female speakers with two languages English and Urdu in form of five regional accents. Mel Frequency Cepstral Coefficient (MFCC) features were extracted from the collected voice samples to train Support Vector Machine (SVM), Artificial Neural Network (ANN), Random Forest (RF) and K-nearest neighbor (KNN) classifiers. The results indicate that ANN outperformed SVM, RF and KNN by achieving 88.53% and 86.58% recognition accuracy on both datasets respectively.

Hence biometric approaches may provides superior security and convenience than recognition techniques based on PIN, passwords and identity cards [3,4]. Biometric traits can be used in various applications such as ATM, credit cards, physical access control, cell phone, national ID cards, passport control, driver licenses, dead body identification, and criminal investigation, etc. [5]. Each biometric has its own features and limits. The selection of a biometric particularly depends on the applications for which it is being used. No single biometric is optimal and nor it efficiently fulfills all the requirements of various applications. For example, in some situations, the fingerprint biometric trait is more desirable than the voice biometric trait. In another situation, the voice biometric is preferable than finger print, such as access control for bank transactions via cell phones or landline telephones, voice mails and verification of credit cards, distant access to computers through a modem on the dial-up telephone line in call-centers and forensic applications where speaker recognition is required [4,6]. The human voice carries different characteristics such as the meaning/ words a speaker wants to pass to a listener, spoken language information, emotions, gender, identity, health and speaker's age-related information etc. [6]. The objective of speaker recognition is to extract information about the speaker's identity and based on that information it recognizes the speaker [7]. Speaker recognition is usually subdivided into speaker verification and speaker identification tasks. The speaker verification is the task of verifying a claimed person from his/her voice and verification system must perform a 1:1 comparison hence the cost of computation is independent of the records in the voice database. On the other hand, the speaker identification task is to determine the specific speaker speaking from a speaker's database. In this task the unknown person does not claim identity and there must be 1:N comparisons. In this way, the cost of computation depends on the number of records in the voice database [8].
An important step in designing Voice Based Systems (VBS) i.e. speech and speaker recognition systems is the voice database design. A comprehensive voice database plays significant role for design and development of VBS specifically designed for particular applications, same is the case here (refer to "Database" section for further detail about the designed database). In this research, we developed an Urdu and English languages based voice dataset particularly designed for voice based customer verification services in banking sector in Pakistan. To the best of our knowledge, there is no such voice dataset available, which has been particularly designed for the banking sector in Published literature.
During the database design we particularly focused on minimizing different performance degrading factors/variability mostly encountered in voice databases such as background noise, channel mismatch, speaker's age, health & emotions etc.
Like the other performance degrading factors, different accents and dialects of a language can also cause performance degradation in VBS [9]. Typically, VBS do not perform well if the accent of a speaker, who is going to be recognized, is different from the accent of the speakers by whom the system was trained. Incorporation of accents can minimize the variability caused by different accents of a language, which in turn enhances the performance of the recognition system [10].
Based on the above mentioned detail about VBS and their possible performance degradation factors, here in this research, a voice database (containing Urdu and English datasets) has been designed and tested on several popular ML methods (KNN, RF, SVM and ANN in our case). The specific contributions of the presented research are: • Designing a voice database for the five regional accents of Urdu and English spoken in Pakistan. • Including accents variations in the designed database so that it could be used to design robust Speaker Recognition Systems (SRS) based on this dataset. • Database design to particularly help banking sector applications in Pakistan. • Database design based on a questionnaire/script containing question answers particularly asked by bank representatives for authentication during voice calls in case of lost/theft of credit/debit cards or any other query with the banks in Pakistan. • Testing KNN, RF, SVM and ANN algorithms on the designed dataset.
The other aforementioned performance degrading factors (other than different accents and dialects) have also been taken into consideration during database design, however, to fully address all the performance degrading factors from a single designed system is still a challenging task, which is one of the limitation of the present research. Further detail of the development of the voice database is presented in "Database" section. Related research is presented in "Related research" section. Methodology is presented in "Methodology" section. Results and discussions are provided in "Results and discussion" section. Finally in 'Conclusion" section, conclusion is provided.

Related research
This section provides a review of some of the recent SRS particularly robust against different types of variability (performance degrading factors like accents and dialects of a language) present in them.
Development of voice datasets is an important aspect of designing SRS. Particularly, SRS designed for specific applications need their own voice dataset to develop because such datasets are usually not available, same is the case here in this research.
Some of the variability present in SRS (room reverberation, channel mismatch, seasonal variations) are tried to be addressed during voice database design. However, most of them (accents and dialects, emotions, background noise) need to be addressed during the design and development phase of the SRS [11,12].
National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) [13] is an ongoing series of developing speaker recognition and exploring new promising ideas in this field. The task specifically includes voice database design and development of state of the art SRS robust against different variability/mismatch conditions.
RedDot [14] is a project to collect speech data over mobile devices for speaker recognition. The designed database contains speech data of 45 English speakers (both native and non-native) from 16 countries. The content of the database consists of a short duration test utterances with variable phonetic content. The main focus for the RedDots database was to include a high degree of inter-speaker variations and intra-speaker variations. To achieve the main focus, the speakers were selected worldwide and data was collected from speakers in 91 different sessions.
Although RedDot and NIST speaker recognition evaluations provide great opportunity to the research community to use their datasets and to evaluate their systems but their focus was mostly towards the controlled data collection over mobiles or landline telephones, which restricts the dimensions of different variability present in the data [15].
To fill some of these gaps present in the previous datasets (particularly NIST-SRE), a new voice dataset "Speakers in the Wild (SITW)" was created [16].
SITW [15] is a speaker recognition database specifically collected for the text-independent speaker recognition applications. This database contains audio recordings of 299 speakers that were collected from open source media, with an average of 8 sessions per speaker. The audio recordings in database include unconstrained/wild acoustic conditions like background noise, reverberation as well as large intra-speaker variability. The database also contains audios of speakers in different scenarios like interview, dialog or uncontrolled conditions where multiple speakers are involved. SITW filled much of the gaps present in the previous datasets like NIST but because of manually annotations, its size was quite small.
VoxCeleb2 [17] is a very large-scale audio-visual speaker recognition publicly available dataset collected from open-source media, which contains over a million utterances from over 6000 speakers. Along with its large size, the dataset specially focused to improve the limitations exhibited by the most of the previous datasets like recording under controled conditions [18][19][20][21][22] limited in size because of manually annotations, and not freely available to the research community [18,23].
Accent and dialect classification systems have also provided a great help in solving accent and dialect related variability and designing robust SRS [24].
A weighted accent classification system has been designed by using multiple words [25]. In this research, extreme learning machines (ELMs) and SVMs were used to the problem of accent/dialect classification on TIMIT dataset. The TIMIT comprises utterances from 630 speakers indicating eight different dialect regions of the United States. Design of regional accented databases as well as accent and dialect classification is another important area of research that helps in improving performance of speaker and speech recognition systems designed for specific regions.
An Algerian Speech corpus [26] was designed to support research in speech recognition. The corpus represents 300 Algerian native speakers who could speak Modern Standard Arabic (MSA) language. The speakers were selected from 11 different regions of Algerian, with both genders (148 males and 152 females), with different age groups and with different educational levels (primary school to postgraduate level). Finally, using a subset of the collected dataset, the author has designed a text-independent ASR system using Hidden Markove Model (HMM) that achieved a 91.65% recognition rate.
Shah et al. [10] designed a voice database considering the same strategy i.e. the voice samples were recorded using different recording devices to minimize the channel variability, quite rooms were used to record the voice samples to minimize the effect of the room reverberation, data was recorded with different sessions spread over ample amount of time to cope with the seasonal variations' effect on the collected dataset as well as a particular script containing the words spoken differently in different accents and dialects to make the designed dataset robust against accentual and dialectical variations present in Pashtu language. On the other hand, Gaussian noise was added in the extracted feature vectors to make the designed SRS robust against background noise. An accent classification system for classifying the regional accent of Philippine was presented by Dano et al. [27]. Voice data was collected from 150 native residents of Philippine. MFCC features were extracted to train MLP and k-NN classifier. In this research, MFCC-MLP based accent classification system outperforms k-NN by achieving recognition accuracy of 56.19%.
A small scale database of 11 speakers including 7 males and 4 females, with their age ranging from 19-36 years was collected by Thullier et al. [28]. All of the speakers were native French. Some speakers from all had unique Canadian French accents and Hexagonal French accents. The voice samples were collected in a silent room (university meeting room) as well as in a noisy environment (i.e. University cafeteria).
Based on the above discussion and the importance of designing regional voice databases, here in this research we designed a voice database for five regional accents (spoken in Gilgit Biltistan (GB), Pakistan) of Urdu and English, which are the important regional accents of Pakistan. The next section describes the database development phase of our research.

Database
This paper presents development of voice datasets used to evaluate the performance of various ML algorithm. The datasets contain 7200 voice samples of 180 different speakers of GB region of Pakistan. The proposed datasets add diversity to existing datasets in terms of different local accents. Datasets development consists of various steps, each step is explained below.

A regional voice dataset collection with accent variation
To design Urdu and English voice datasets with various accentual and dialectical variations, the Gilgit-Baltistan (GB) Provence of Pakistan has been selected. GB is an important region of the China-Pakistan Economic Corridor (CPEC) and has great diversity in voice accents. It borders with Azad Kashmir, Jammu Kashmir, Khyber Pakhtunkhwa, Afghanistan, and China. GB is an area of high mountains and has an area of over 72,496 km 2 . The capital city of GB is Gilgit and the population of GB is about 2.0 millions. There are ten districts in GB i.e. Gilgit, Nagar, Hunza, Ghizer, Astore, Skardu, Diamer, Ghanche, Shigar, and Kharmang. The people of this region have different native languages and have different cultures and backgrounds. There are five different native languages spoken in these districts, which are Shina, Balti, Burushishki, Khuwar and Wakhi [29,30]. Speakers of each of these districts speak their standard native Languages (which are used as a mean of official communication in offices as well as are communicated on radio broadcasts) with different accents. Based on these five regional accents, several speakers were chosen from each district as indicated in Table 1.

Age distribution of the speakers
For the design of voice database, the voice data was collected from male and female speakers with different ages ranging from 18-69 years. The purpose of selecting speakers with different age groups is to include acoustic variation, which arises in the voice of speakers at different stages of age to cover maximum telephone banking customers and mobile users. The Age-wise distribution of the speakers is shown in the Fig. 1

Design of script for speakers
The voice data was collected from each selected speaker based on two specifically designed scripts in Urdu and English languages. The scripts contain all possible conversational talk between a phone banking officer/mobile call center agent and their customers. These scripts contain sentences in the form of words, 10-16 digit strings, and the speaker's personal information mostly related to bank and mobile network services. All together 20 sentences with average time duration ranging between 10 and 100 ms were included in each of the scripts. The scripts were provided to each speaker to read for recording data. The designed written script for the English language is shown in Table 2, whereas, the same script was translated into Urdu for the recording of voice samples in Urdu language.

Voice data recording environment and device allocation
The data was recorded from speakers in the university office, seminar room, and rest room using different smartphones and landline. The specification of the smart phones which have been used for data recording is as follows.

Landline
The voice data was recorded from a total of 180 speakers. The speakers were sub divided into 6 groups where each group contained 30 speakers. All the groups were assigned a different specific device for recording.

Recording sessions
The voice data was collected in seven recording sessions with a gap of at least one month. Figure 2 depicts recording sessions and their corresponding speakers. The  [31].

Recording of voice samples
The designed scripts (Urdu and English) were provided to each speaker who was selected in a particular session for recording. Before the start of each session, each speaker was communicated on how to record their voices, and afterword, they were supposed to rehearsal for a short period. Finally, the data was collected sentence by sentence according to the script. After the recording of each sentence, the recorded sample was verified by just replaying the recorded sentence to ensure the acquisition of appropriate sample. Since the scripts contained 20 different sentences each, therefore, each speaker recorded 20 separate sentences for the Urdu language as well as for English. A total of 3600 (20 * 180) voice samples have been recorded for the English Language. Similarly, a total of 3600 (20 * 180) voice samples have been recorded for the Urdu language. So overall a total of 7200 (3600 + 3600) voice samples have been recorded. All the recorded samples are then transferred to a laptop and converted to .wav from the default format of the allocated devices using audio converter (4dots software) for further processing. The voice samples were recorded in a systematic way as shown in Fig. 3. As per Fig. 3, the designed scripts were distributed to the speakers selected for a particular session for recording. During a practice session, the speakers were given instructions on how to read the script and making them familiar with the acquisition process. It was a kind of practice session before the actual recording. Afterwards, the speaker's voice samples were recorded sample by sample. Each sample was cross-checked with the script to ensure the consistency of the acquired voice sample with the script. All consistent samples were kept as voice database and inconsistent samples were discarded and the process was continuing until the collection of all voice samples.

Voice samples preprocessing
After data collection, the collected voice samples were pre-processed and features were extracted using MFCC. Pre-processing of speech plays an important role in the development of an efficient automatic speech/speaker recognition systems. Pre-processing is an important step for ML algorithms to produce better results. In speech processing, the pre-processing includes noise cancellation, pre-emphasis and silence removal. Pre-processing facilitates the voice-based recognition systems to be computationally efficient [32]. Due to the characteristics of the human vocal system, glottal airflow and lip radiations depress higher frequency components of the voiced part of the sound signal. For the voiced sound signal, the glottal pulse has a slope of approximately − 12dB/octave, and the lip radiation has a slop of approximately + 6dB/ octave. Resultantly sound signal introduces a slope of − 6 dB/octave down-ward if compared with the spectrum of vocal tract. The process to remove the slope of − 6 dB /octave is known as pre-emphasis and it removes the effects of the glottal pulse from actual vocal tract and balance power spectrum dynamics. Thus, all voice samples have been passed through a high pass filter. It amplifies high-frequency components with respect to low-frequency components. The pre-emphasis method ensures that all formants of the voice signal have identical amplitude so that have equal importance in subsequent processing steps [33]. After pre-emphasis, the voice samples were further processed to remove silence using amplitude based silence removal technique [34] This technique divides the whole audio sample into components of short fixed length called frames and calculates maximum amplitude of each frame. It then finds those frames with the maximum amplitude is greater than 0.03 and considers those frames as voice portion of the speech and discards the frames with lesser amplitude then 0.03. This technique assumes that the silent part of the voice signal has amplitude < 0.03 and the voice part of voice signal contains amplitude > 0.03 . The pre-processing is shown in Fig. 4.

Fig. 3 Voice samples recording process
As per Fig. 4, for pre-processing of voice samples, initially, each voice sample was preemphasized by passing through a first order high pass filter. Each pre-emphasized voice sample is then divided into 20 ms to 30 ms duration overlapping frames for analysis. Each frame is then analyzed using Hamming window. Based on the amplitude of each windowed frame, frames with silence and without silence were separated and frames with silence were removed.

Features extraction
After pre-processing, MFCC features were extracted from all voice samples. Feature extraction is the next important step after pre-processing for developing voice based recognition systems. The output from the feature extraction process is the main input for speaker model development and matching processes.
The MFCC technique is a most popular, has a huge achievement and extensively used in the speaker and speech recognition systems [35,36]. It is based on a logarithmic scale and is able to estimates human auditory response in a better way than the other cepstral feature extraction techniques [37,38]. MFCC features are derived from short-term Fast Fourier Transform (FFT) power spectrum of the pre-emphasized input speech samples. To obtain these features, initially, the pre-emphasized input voice sample is divided into fixed length segments known as frames. The purpose of framing is to analyze the input voice sample in nearly non-varying/static form (by nature, speech signals are non-stationary). After framing each frame is passed Fig. 4 Voice samples' pre-processing through a window (hamming window in our case) to analyze. Each frame is analysed by applying window to remove discontinues at the beginning and the end of each frame. The resulting speech sample is then transformed into frequency domain from time domain by simply applying FFT. Transformed signal values are then plotted against the Mel scale (Mel-scale has linear frequency spacing lower than 1000 Hz and a logarithmic frequency interval higher than 1000 Hz) [39]. Finally, MFCC coefficients are obtained by using a Discrete Cosine Transform (DCT) of the logarithm of the power on each Mel frequency.
The process of feature extraction is shown in Fig. 5. In this research, 19 MFCC were obtained and used as feature vectors for ML algorithms.

Classification models
After extraction of voice features (MFCCs), some of the popular ML algorithms relevant and suitable to the present research such as SVM, ANN, k-NN, and RF [40][41][42] were trained using training set of feature vectors and tested on the set of test feature vectors. To build learning models and for train-test splits of feature vectors, ten-fold crossvalidation (TFCV) technique was used. In TFCV method, the original feature vector is randomly divided into ten nearly equal sub samples. One of the ten samples is randomly chosen as a test feature vector, whereas, all the remaining sub-samples are used as training feature vectors. Similar, process is repeated until all the ten sub-samples of feature vectors have been tested [43]. In this procedure, since, the classification accuracy is based on ten estimates rather than just a single estimate, therefore, TFCV produces a more precise estimate of classification accuracy than cross validation [44].
SVM is a supervised machine learning model [45] that analyzes data and recognizes patterns, used for regression and classification. It is a well-known discriminative classifier that models the boundary between the speaker and a group of impostors. We implemented a Sequential Minimal Optimization (SMO) algorithm to train SVM classifier [46].
K-NN algorithm is a family of lazy and instance-based learning algorithms. Whenever it is necessary to classify a sample of unknown data from a test data set, the KNN's task is to examine training data set for the most related k samples. Instance based classification algorithm, provide an efficient implementation of the KNNs [47]. We used euclidean distance based approach for KNN implementation [48].
RF belongs to a family of supervised learning model, used for the task of classification and regression. RF works by constructing a huge amount of decision trees during Fig. 5 Features extraction process the training phase and producing a class that is the mode of the classes. RF is a very precise and robust classifier and is not subject to overfitting problem [49]. We implemented RF with bootstrapping of samples technique [40] in this research.
ANN is used in a wide range of applications. It consists of a collection of various neurons often called nodes of network connection and is a simplified version of the human brain [50]. It consists of an input, hidden, and output layers. Its objective is to get inputs and transform them into meaningful outputs. Here in this research, Multilayer Perceptron (MLP) algorithm of ANN with back propagation feed-forward algorithm is implimentd to classify instances and log-sigmoid function as a neuron activation function [51].

Performance evaluation measurements
The behavior of each classification model is assessed on the basis of certain parameters for measuring its efficiency. The performance of model is influenced by the training data size, the quality of voice records, and most significantly the type of ML model used. We have used the following measurement matrices to assess the efficiency of the ML models [52,53]: Accuracy: It shows how frequently the classifier predicts the correct values and can be calculated as: where TP is the number of samples predicted as positive that are actually positive; FP is the number of samples predicted as positive that are actually negative; TN is the number of samples predicted as negative that are actually negative; FN is the number of samples predicted as negative that are actuall positive Root Mean Squared Error: It is used to measure the mean magnitude of the errors in an experiment using a quadratic scoring rule and it can be calculated as bellow:: where ŷ i is the estimate of y i .

Results and discussion
To test the effectiveness of ML models on the collected voice datasets, we performed two experiments. In the first experiment, the performance of ANN, SVM, KNN and RF models was evaluated on the feature vectors obtained from the English voice dataset, whereas, in the second experiment, the performance of the same models was evaluated on the feature vectors obtained from the Urdu voice dataset. In each experiment, for the training and testing splits of features data, TFCV method was used. WEKA was used as implementation tool to implement ML classifiers.     Table 4 shows that ANN classifier outperformed SVM, RF and KNN by showing 2.99%, 3.25% and 2.42% improvement in classification accuracy respectively in classifying speakers of English dataset. Root Mean Square Error (RMSE) is the standard deviation of the recognition errors. SVM has the highest RMSE at 0.074 followed by RF at 0.049. ANN and KNN have the lowest RMSE of 0.032 and 0.038. Table 5 shows that again ANN outperformed RF, KNN and SVM by 5.4%, 3.5% and 4.8% respectively in classifying speakers on Urdu dataset. Furthermore, SVM has the uppermost RMSE of 0.074 followed by RF of 0.051, whereas ANN and KNN have the lower RMSE of 0.034 and 0.042 respectively. This also indicates that SVM, RF and KNN misclassified most of the Urdu voice dataset compared to ANN. Figures 6 and 7 provide the comparison of the performance of ML models on English and Urdu datasets. Figure 6 provides the comparision of recognition accuracies of ML models (i.e. ANN, SVM, KNN and RF) on English and Urdu datasets. All the models achieved higher accuracy on English dataset as compared to Urdu dataset. However, ANN achieved high accuracy for both the datasets as compared to other classifiers. Fig. 6 The accuracy of the classification models with TFCV for English and Urdu Dataset Fig. 7 The RMSE of the classification models with TFCV for English and Urdu Dataset Figure 7 provides the comparision of RMSE of the models on English and Urdu datasets. All the models provided Lower RMSE on English dataset as compared to Urdu dataset. However, ANN model showed lower RMSE for both the datasets.
It can be seen through the results provided that ANN outperformed all the other tested ML models on the developed datasets. As already mentioned, Multilayer Perceptron (MLP) algorithm of ANN was used to classify different speakers based on different accents. MLP better approximates the classes overlapping in nature [54]. The same is case here in this research, where the different regional accents of GB, Pakistan are overlapping in nature, that is one of the reason why ANN outperformed the other used ML algorithms.

Comparison with the other systems in literature
Finally, the achieved experimental results were compared with some of the state of the art recently proposed classifier models. During the literature review, it was found that: • Rizwan et al. [25] applied SVM on TIMIT dataset and achieved 77.8% recognition accuracy. • Danao et al. [27] applied MLP on their own designed Philippine dataset and achieved 56.19% recognition accuracy. • Shah et al. [10] applied MLP on their own developed Poshtu speakers' dataset and achieved 87.5% recognition accuracy. • Liu et al. [55] developed an MFCC-based text-independent speaker identification system for access control. In their system, along with MFCC features, they used Gaussian Mixture Models (GMM) as a classifier. Their system achieved overall 86.87% identification accuracy.
Comparative studies indicates that the proposed ANN model using collected English dataset outperformed all the above mentioned models reported in literature.

Conclusion
In this paper, the authors have designed voice datasets in Urdu and English languages with five different regional accents spoken in GB, located at the north of Pakistan. These datasets are specifically designed to support and extend research in the domain of speaker recognition systems. The designed voice datasets represents 7200 voice samples of 180 speakers and the content is in the form of single words, 10-16 digit strings and speaker's personal information. Designed datasets were pre-processed to extract voice features and used in training four ML algorithms including SVM, ANN, RF and KNN. The recognition accuracy indicates that ANN classifier model outperforms SVM, RF and KNN by achieving 88.53% and 86.58% on English and Urdu dataset respectively. Similarly, RMSE of ANN is best among other ML algorithms as the ANN model is simple and does not overfit. Moreover, it was found that ANN outperformed some of the recently proposed classifiers in literature.