Skip to main content

Table 4 Named entity recognition

From: An analytical study of information extraction from unstructured and multidimensional big data











Self-training with CNN-LSTM-CRF

To improve the performance and accuracy of NER for large scale unlabeled clinical documents

Medical (clinical text)

19,378 patient data






To improve the performance of NE extraction on OMD large size data and complex data structure without manual rules or features i.e. to deal with volume and variety

Online medical diagnosis text (EMR)

untagged corpus of 320,000 Q&A records of the online Q&A website

Trained with 1/3, 2/3 and all of the total data to compare the experimental performance which results 87.26%, 88.79% and 90.31% F-measure resp.


Comparison of BioNLP task with 3 sequence labeling techniques: CRF, MEMM, SVMhmm using one classifier SVMmulticlass

To evaluate the performance of ML methods and to identify best features for automatic extraction of habitat entities

Features Used: orthographic, morphological, syntactic, semantic

Bacterial Biotope entities

BioNLP 2 datasets BB2013, BB2016

CRFs and SVMhmm have comparable performance, but CRFs achieve higher precision whereas SVMhmm has better recall

CRFs and MEMM are shown to be more robust than SVMhmm under poor feature conditions


SML based pi-CASTLE: Crowd assisted IE system

To store text annotation in database and addresses the challenges of probabilistic data model, selection of uncertain entities, integration of human entities


For NER: CoNLL 2003 corpus, TwitterNLP dataset with 2400 unstructured tweets

pi-CASTLE achieves an optimal balance between cost, speedand accuracy for IE problems


Hybrid method to automatically generate rule

To automatically extract and structured patient related entities from large scale data

Diagnosis extraction

EHR clinical notes of 9.5M patient records

5 use cases applied to prove the modularity, extensibility, scalability, and flexibility


Unsupervised ML (clustering)

Examine the impact of volume on three US ML (spectral, agglomerative, and K-Means clustering)

Facebook posts

314,773 posts by companies and 1,427,178 posts by users for these companies




Spectral clustering performed better on larger datasets


Grammar rules + MapReduce

To handle large amount of data with parallelization

Suitable for incomplete datasets

Free text

3 different text datasets with 1293, 689, 1654 sentences resp.

The results show better recall on 3 text datasets but low precision