Skip to main content

Table 12 Automatic speech recognition

From: An analytical study of information extraction from unstructured and multidimensional big data

 

Purpose

Approach

Technique

Dataset

Results/limitations

[68]

To improve computational power

To enhance the training capability of larger models

To ease the process

ANN

Mariana: GPU and CPU clusters for parallelism

Three frameworks were developed: multi-GPU for DNN, multi-GPU for DCNN, CPU cluster for large scale DNN

With 6 GPUs, 4.6 times speedup over one GPU was achieved and character error rate was decreased by 10% as compared to existing techniques

DNN framework with GPUs performed better for ASR

[72]

To investigate noise robustness on DNN based models

ANN

DNN-HMM: DNN based noise aware training

Aurora 4 w/o explicit noise compensation

7.5% relative improvement

Dropout training in DNN with overlapping concern as compared to feature space and model space noise adaptive training

[73]

Bilingual ASR system for Frisian and Dutch languages

ANN

DNN with language dependent and language independent phone

FAME Speech database

Bilingual DNN trained on phones of both languages achieved best performance yielding CS-WER of 59.5% and WER of 38.8%

Code switching ASR combines phones of two languages outperformed on WER whereas latency of switching is also an important factor for these systems

[75]

To improve performance for large vocabulary speech recognition

ANN

LSTM RNN

2800 utterance, each distorted once with held-out noise samples

On 25 k word vocabulary, 19.5% WER, and 14.5% in vocabulary WER

Word level acoustic models without language model can achieve reasonable accuracy

[74]

To compare the performance of DNN-HMM with CNN-HMM

ANN

CNN with limited weight sharing scheme to model speech features

Small-scale phone recognition in TIMIT large vocabulary voice search task

CNN reduce error rate by 6% to 10% compared with DNN

ASR performance is sensitive to pooling size but insensitive to overlap b/w pooling units

The results were better for voice search experiment but not for phone recognition

[71]

To develop ASR for Amazigh language

HMM

GMM and tied states

MFCC for feature extraction, phonetic dictionary, language model using CMU-Cambridge Statistical Language Modeling Toolkit, HMM based large vocabulary system

New corpus with 187 distinct isolated word speech recording by 50 speakers

Achieved reduced WER to 8, 20%

The new corpus was collected. Results are not compared with existing state of the art techniques

[69]

LMS Adaptive filter are introduced to preprocess the speech signals and to identify speaker

Template based

Adaptive Filtering + feature extraction + dimensionality reduction + ensemble classification model using LSTM, ICNN, and SVM

IITG multi variability speaker recognition database

Achieved 95.69% accuracy for noisy data

Follows sequential processing

Require memory-bandwidth bound computation

Required large amount of training data for each new speaker

[70]

ASR for Tunisian dialect

Rule based

G2P rules were defined to build pronunciation dictionaries

TARIC, 9.5 h speech for training and 43 min for testing

WER of 22.6%

Validated on manually annotated dataset

Improved quality pronunciation dictionaries can be build using expert knowledge but high linguistic skills are required