Skip to main content

Table 12 Automatic speech recognition

From: An analytical study of information extraction from unstructured and multidimensional big data

  Purpose Approach Technique Dataset Results/limitations
[68] To improve computational power
To enhance the training capability of larger models
To ease the process
ANN Mariana: GPU and CPU clusters for parallelism
Three frameworks were developed: multi-GPU for DNN, multi-GPU for DCNN, CPU cluster for large scale DNN
With 6 GPUs, 4.6 times speedup over one GPU was achieved and character error rate was decreased by 10% as compared to existing techniques
DNN framework with GPUs performed better for ASR
[72] To investigate noise robustness on DNN based models ANN DNN-HMM: DNN based noise aware training Aurora 4 w/o explicit noise compensation 7.5% relative improvement
Dropout training in DNN with overlapping concern as compared to feature space and model space noise adaptive training
[73] Bilingual ASR system for Frisian and Dutch languages ANN DNN with language dependent and language independent phone FAME Speech database Bilingual DNN trained on phones of both languages achieved best performance yielding CS-WER of 59.5% and WER of 38.8%
Code switching ASR combines phones of two languages outperformed on WER whereas latency of switching is also an important factor for these systems
[75] To improve performance for large vocabulary speech recognition ANN LSTM RNN 2800 utterance, each distorted once with held-out noise samples On 25 k word vocabulary, 19.5% WER, and 14.5% in vocabulary WER
Word level acoustic models without language model can achieve reasonable accuracy
[74] To compare the performance of DNN-HMM with CNN-HMM ANN CNN with limited weight sharing scheme to model speech features Small-scale phone recognition in TIMIT large vocabulary voice search task CNN reduce error rate by 6% to 10% compared with DNN
ASR performance is sensitive to pooling size but insensitive to overlap b/w pooling units
The results were better for voice search experiment but not for phone recognition
[71] To develop ASR for Amazigh language HMM GMM and tied states
MFCC for feature extraction, phonetic dictionary, language model using CMU-Cambridge Statistical Language Modeling Toolkit, HMM based large vocabulary system
New corpus with 187 distinct isolated word speech recording by 50 speakers Achieved reduced WER to 8, 20%
The new corpus was collected. Results are not compared with existing state of the art techniques
[69] LMS Adaptive filter are introduced to preprocess the speech signals and to identify speaker Template based Adaptive Filtering + feature extraction + dimensionality reduction + ensemble classification model using LSTM, ICNN, and SVM IITG multi variability speaker recognition database Achieved 95.69% accuracy for noisy data
Follows sequential processing
Require memory-bandwidth bound computation
Required large amount of training data for each new speaker
[70] ASR for Tunisian dialect Rule based G2P rules were defined to build pronunciation dictionaries TARIC, 9.5 h speech for training and 43 min for testing WER of 22.6%
Validated on manually annotated dataset
Improved quality pronunciation dictionaries can be build using expert knowledge but high linguistic skills are required