Skip to main content

Table 4 Named entity recognition

From: An analytical study of information extraction from unstructured and multidimensional big data

  Technique Purpose Domain Dataset Results
P% R% F%
[9] Self-training with CNN-LSTM-CRF To improve the performance and accuracy of NER for large scale unlabeled clinical documents Medical (clinical text) 19,378 patient data 84.2 85.5 84.4
[10] CNN + Bi-LSTM and CRF To improve the performance of NE extraction on OMD large size data and complex data structure without manual rules or features i.e. to deal with volume and variety Online medical diagnosis text (EMR) untagged corpus of 320,000 Q&A records of the online Q&A website Trained with 1/3, 2/3 and all of the total data to compare the experimental performance which results 87.26%, 88.79% and 90.31% F-measure resp.
[11] Comparison of BioNLP task with 3 sequence labeling techniques: CRF, MEMM, SVMhmm using one classifier SVMmulticlass To evaluate the performance of ML methods and to identify best features for automatic extraction of habitat entities
Features Used: orthographic, morphological, syntactic, semantic
Bacterial Biotope entities BioNLP 2 datasets BB2013, BB2016 CRFs and SVMhmm have comparable performance, but CRFs achieve higher precision whereas SVMhmm has better recall
CRFs and MEMM are shown to be more robust than SVMhmm under poor feature conditions
[12] SML based pi-CASTLE: Crowd assisted IE system To store text annotation in database and addresses the challenges of probabilistic data model, selection of uncertain entities, integration of human entities   For NER: CoNLL 2003 corpus, TwitterNLP dataset with 2400 unstructured tweets pi-CASTLE achieves an optimal balance between cost, speedand accuracy for IE problems
[13] Hybrid method to automatically generate rule To automatically extract and structured patient related entities from large scale data Diagnosis extraction EHR clinical notes of 9.5M patient records 5 use cases applied to prove the modularity, extensibility, scalability, and flexibility
[14] Unsupervised ML (clustering) Examine the impact of volume on three US ML (spectral, agglomerative, and K-Means clustering) Facebook posts 314,773 posts by companies and 1,427,178 posts by users for these companies 40.7 83.5 56.3
Spectral clustering performed better on larger datasets
[15] Grammar rules + MapReduce To handle large amount of data with parallelization
Suitable for incomplete datasets
Free text 3 different text datasets with 1293, 689, 1654 sentences resp. The results show better recall on 3 text datasets but low precision