From: An analytical study of information extraction from unstructured and multidimensional big data
 | Technique | Purpose | Domain | Dataset | Results | ||
---|---|---|---|---|---|---|---|
P% | R% | F% | |||||
[9] | Self-training with CNN-LSTM-CRF | To improve the performance and accuracy of NER for large scale unlabeled clinical documents | Medical (clinical text) | 19,378 patient data | 84.2 | 85.5 | 84.4 |
[10] | CNN + Bi-LSTM and CRF | To improve the performance of NE extraction on OMD large size data and complex data structure without manual rules or features i.e. to deal with volume and variety | Online medical diagnosis text (EMR) | untagged corpus of 320,000 Q&A records of the online Q&A website | Trained with 1/3, 2/3 and all of the total data to compare the experimental performance which results 87.26%, 88.79% and 90.31% F-measure resp. | ||
[11] | Comparison of BioNLP task with 3 sequence labeling techniques: CRF, MEMM, SVMhmm using one classifier SVMmulticlass | To evaluate the performance of ML methods and to identify best features for automatic extraction of habitat entities Features Used: orthographic, morphological, syntactic, semantic | Bacterial Biotope entities | BioNLP 2 datasets BB2013, BB2016 | CRFs and SVMhmm have comparable performance, but CRFs achieve higher precision whereas SVMhmm has better recall CRFs and MEMM are shown to be more robust than SVMhmm under poor feature conditions | ||
[12] | SML based pi-CASTLE: Crowd assisted IE system | To store text annotation in database and addresses the challenges of probabilistic data model, selection of uncertain entities, integration of human entities | Â | For NER: CoNLL 2003 corpus, TwitterNLP dataset with 2400 unstructured tweets | pi-CASTLE achieves an optimal balance between cost, speedand accuracy for IE problems | ||
[13] | Hybrid method to automatically generate rule | To automatically extract and structured patient related entities from large scale data | Diagnosis extraction | EHR clinical notes of 9.5M patient records | 5 use cases applied to prove the modularity, extensibility, scalability, and flexibility | ||
[14] | Unsupervised ML (clustering) | Examine the impact of volume on three US ML (spectral, agglomerative, and K-Means clustering) | Facebook posts | 314,773 posts by companies and 1,427,178 posts by users for these companies | 40.7 | 83.5 | 56.3 |
Spectral clustering performed better on larger datasets | |||||||
[15] | Grammar rules + MapReduce | To handle large amount of data with parallelization Suitable for incomplete datasets | Free text | 3 different text datasets with 1293, 689, 1654 sentences resp. | The results show better recall on 3 text datasets but low precision |