Feature selection methods and genomic big data: a systematic review

Tadist, Khawla; Najah, Said; Nikolov, Nikola S.; Mrabti, Fatiha; Zahi, Azeddine

doi:10.1186/s40537-019-0241-0

Journal of Big Data

Table 5 Feature selection methods in big genomic data analytics

From: Feature selection methods and genomic big data: a systematic review

Refs	App in genomics	Algorithm	Datasets	Evaluation methods	Technologies	Advantages	Disadvantages	Big data addressed			Type of feature selection
Refs	App in genomics	Algorithm	Datasets	Evaluation methods	Technologies	Advantages	Disadvantages	Vo	Va	Ve	F	W	Em	H	En	I
[17]	Sorting genomes	–	No datasets	–	–	–	–	–	–	Y	–	–	–	–	–	–
[18]	Classification	mRMR	Colorectal, liverM, pancreatic, central nervous system (CNS), leukemia data	Cross-validation	–	Good classification accuracy	–	Y	Y	–	Y	–	–	–	–	–
[19]	Prediction	PFBP	Nucleotide polymorphism (SNP) data	Bootstrapping	MapReduce	Reduces time complexity with better accuracy parallelized	–	Y	Y	–	Y	–	–	–	–	–
[20]	Genetic trait prediction	MINT	Real data: maize data rice data pine data	Cross-validation	–	Reduces time complexity	–	Y	Y	–	–	Y	–	–	–	–
[21]	Prediction	Boruta Random Forest	Next- generation sequencing laboratory of Novogene Bioinformatics Institute, Beijing, China,	Bootstrapping	NetBeans	Good prediction accuracy	Small sample size	Y	Y	–	–	Y	–	–	–	–
[22]	Prediction	MIMIC FS	ASU datasets	Cross-validation	Weka	Good performance	–	Y	Y	–	–	Y	–	–	-	–
[23]	Marker selection	FIFS	Single nucleotide polymorphism (SNP)	Train and test	–	Huge rate of success	Not parallelized	Y	Y	–	–	Y	–	–	–	–
[24]	Binning for prediction	Random forest Naïve Bayes	Generated datasets	Train and test	–	Dataset presents better prediction	–	Y	Y	–	–	–	Y	–	–	–
[25]	Classification predicting disease	SVEGA	Breast cancer dataset Kent ridge biomedical repository	TPR/FPR	–	Classification accuracy rate	Not parallelized	Y	Y	-	-	-	Y	-	-	-
[26]	Classification prediction	SVM	Kent Ridge Bio-medical dataSet Repository and National center of Biotechnology Information	ANOVA	Hadoop MapReduce	Good accuracy rate	–	Y	Y	–	–	–	Y	–	–	–
[27]	Classification prediction	K-nearest neighbor	National Center of Biotechnology Information NCBI GEO	Cross-validation	Hadoop MapReduce	Reduces time complexity Parallelized	–	Y	Y	–	–	–	–	–	–	Y
[28]	Identification of gene expression signatures	SVM	20,475 features in 1920 samples, a highdimensional dataset (source not mentioned)	Cross-validation	Weka	Better understanding of the classification	–	Y	Y	–	–	–	Y	–	–	–
[29]	Prediction	Cox-regression	The Cancer Genome Atlas datasets, glioblastoma) and lung adenocarcinoma	Cross-validation	–	Higher true variables rate Better predicting performance Easy-to-implement property	–	Y	Y	–	–	–	Y	–	–	–
[30]	Prediction	mRMR IFS	Genome-wide association studies	Cross-validation	Weka	Good classification performance	Not parallelized	Y	Y	–	–	–	Y	–	–	–
[31]	Prediction	mRMR IFS	UniProtdatabase http://www.uniprot.org	Cross-validation	Weka	High prediction accuracy	Not parallelized	Y	Y	–	–	–	Y	–	–	–
[32]	Classification	ROSEFW-RF	Generated with the ROS technique	Train and test	MapReduce	Parallelized Suitable for large scale data	–	Y	Y	–	–	–	Y	–	–	–
[33]	Genetic association	Screening	GEO database with ID GSE13355 and GSE14905	Cross-validation	–	Good classification accuracy	–	Y	Y	–	–	–	–	Y	–	–
[34]	Classification	Decision Tree Support Vector Machine	UCI machine learning repository http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html	Cross-validation	Weka	Simplicity of implementation Reduces time complexity High accuracy	Expensive computational cost	Y	Y	–	–	–	–	Y	–	–
[35]	Prediction	Pearson Correlation Coefficient (PCC) Information Gain (IG) and ReliefF	Prokaryotic model organism name as E. coli, as a real biologic network	Fitness function	–	High speed and prediction accuracy Easily parallelizable	–	Y	Y	–	–	–	–	Y	–	–
[36]	Classification	SVM	Sentiment classification of on-line reviews using data collected from amazon, imdb, and yelp. Cancer classification based on gene ex-pressions for leukemia, prostate cancer, and lung cancer	Hold-out validation	–	Simplicity and low error rates	Lack of scalability Not parallelized	Y	Y	–	–	–	–	Y	–	–
[37]	Classification	SVM	Breast cancer, colorectal adenocarcinoma, head and neck squamous cell carcinoma, kidney renal clear cell carcinoma, ovarian cancer http://www.cbio.mskcc.org/cancergenomics/pancan	Cross-validation	Weka	Optimal classification	–	Y	Y	–	–	–	Y	–	–	–
[38]	Clustering	Different clustering algorithms	Exome dataset of Brugada syndrome (BrS)	–	–	Suitable for high-dimensional genomic big data	No parallel implementation	Y	Y	–	–	–	–	–	Y	–
[39]	Classification next generation sequencing	SVM Random Forest	NCBI Reference sequence database, http://http://www.ncbi.nlm.nih.gov/refseq/	Cross-validation	Hadoop MapReduce	Scalable High classification accuracy	–	Y	Y	–	–	–	–	–	Y	–
[40]	Identification of genetic markers prediction	Sparse Regression	SNP: a database of single nucleotide polymorphisms http://www.alzgene.org	Cross-validation	-	Good accuracy for selection of features	Not always trivial	Y	Y	–	–	–	–	–	–	Y
[41]	Prediction detecting SNP interactions	LogicFS-GPU	Stimulated and real schizophrenia data set	Cross-validation	MapReduce	Parallel design of the algorithm	Expensive computational cost	Y	Y	–	–	–	–	–	–	Y
[42]	Sequencing	PrefDiv and MGM PC-Stable	Pathway information database Cancer Genome Atlas (TCGA)	Cross-validation	-	Combining two algorithms to inhance accuracy	–	Y	Y	–	–	–	–	–	–	Y
[43]	Prediction	Fireflies and ant colony	PDB Bank dataset Varibench Protein data Lung Cancer data bank Marketing	TPR/FPR	Matlab	High efficiency for feature selection	–	Y	Y	–	–	–	–	–	–	Y
[44]	Classification for prediction	ANOVA and K-Nearest Neighbor	NCBI GEO Leukemia Ovarian Cancer Breast Cancer	ANOVA	MapReduce	Distributed and scalable	–	Y	Y	–	–	–	–	–	–	Y
[45]	Classification for prediction	Decision tree k-nearest-neighbor	Brugada syndrome at Centre for Medical Genetics http://www.uzbrussel.be	Cross-validation	Weka	Good prediction accuracy Good with heterogeneous data	–	Y	Y	–	–	–	–	–	–	Y
[46]	Classification	–	Real-life biomedical data, SNP repository data; mixture models simulation studies	Cross-validation	MapReduce	High classification performance Parallelized	–	Y	Y	–	–	–	–	–	–	Y
[47]	Classification	–	Graph datasets of protein 3D-structures	Cross-validation	MapReduce	Improves prediction accuracy	–	Y	Y	–	–	–	–	–	–	Y

Vo volume, Va variety, Ve velocity, F filter, W wrapper, Em embedded, H hybrid, En ensemble, I integreted
(–) Refers to a lack of definition in the referenced papers
(Y) Refers to the suitability with the titles

Back to article page