An analytical study of information extraction from unstructured and multidimensional big data

Adnan, Kiran; Akbar, Rehan

doi:10.1186/s40537-019-0254-8

Journal of Big Data

Table 4 Named entity recognition

From: An analytical study of information extraction from unstructured and multidimensional big data

	Technique	Purpose	Domain	Dataset	Results
	Technique	Purpose	Domain	Dataset	P%	R%	F%
[9]	Self-training with CNN-LSTM-CRF	To improve the performance and accuracy of NER for large scale unlabeled clinical documents	Medical (clinical text)	19,378 patient data	84.2	85.5	84.4
[10]	CNN + Bi-LSTM and CRF	To improve the performance of NE extraction on OMD large size data and complex data structure without manual rules or features i.e. to deal with volume and variety	Online medical diagnosis text (EMR)	untagged corpus of 320,000 Q&A records of the online Q&A website	Trained with 1/3, 2/3 and all of the total data to compare the experimental performance which results 87.26%, 88.79% and 90.31% F-measure resp.
[11]	Comparison of BioNLP task with 3 sequence labeling techniques: CRF, MEMM, SVM^hmm using one classifier SVM^multiclass	To evaluate the performance of ML methods and to identify best features for automatic extraction of habitat entities Features Used: orthographic, morphological, syntactic, semantic	Bacterial Biotope entities	BioNLP 2 datasets BB2013, BB2016	CRFs and SVM^hmm have comparable performance, but CRFs achieve higher precision whereas SVM^hmm has better recall CRFs and MEMM are shown to be more robust than SVM^hmm under poor feature conditions
[12]	SML based pi-CASTLE: Crowd assisted IE system	To store text annotation in database and addresses the challenges of probabilistic data model, selection of uncertain entities, integration of human entities		For NER: CoNLL 2003 corpus, TwitterNLP dataset with 2400 unstructured tweets	pi-CASTLE achieves an optimal balance between cost, speedand accuracy for IE problems
[13]	Hybrid method to automatically generate rule	To automatically extract and structured patient related entities from large scale data	Diagnosis extraction	EHR clinical notes of 9.5M patient records	5 use cases applied to prove the modularity, extensibility, scalability, and flexibility
[14]	Unsupervised ML (clustering)	Examine the impact of volume on three US ML (spectral, agglomerative, and K-Means clustering)	Facebook posts	314,773 posts by companies and 1,427,178 posts by users for these companies	40.7	83.5	56.3
[14]	Unsupervised ML (clustering)		Facebook posts		Spectral clustering performed better on larger datasets
[15]	Grammar rules + MapReduce	To handle large amount of data with parallelization Suitable for incomplete datasets	Free text	3 different text datasets with 1293, 689, 1654 sentences resp.	The results show better recall on 3 text datasets but low precision

Back to article page