Skip to main content


Table 5 Feature selection methods in big genomic data analytics

From: Feature selection methods and genomic big data: a systematic review

Refs App in genomics Algorithm Datasets Evaluation methods Technologies Advantages Disadvantages Big data addressed Type of feature selection
Vo Va Ve F W Em H En I
[17] Sorting genomes No datasets Y
[18] Classification mRMR Colorectal, liverM, pancreatic, central nervous system (CNS), leukemia data Cross-validation Good classification accuracy Y Y Y
[19] Prediction PFBP Nucleotide polymorphism (SNP) data Bootstrapping MapReduce Reduces time complexity with better accuracy parallelized Y Y Y
[20] Genetic trait prediction MINT Real data: maize data rice data pine data Cross-validation Reduces time complexity Y Y Y
[21] Prediction Boruta Random Forest Next- generation sequencing laboratory of Novogene Bioinformatics Institute, Beijing, China, Bootstrapping NetBeans Good prediction accuracy Small sample size Y Y Y
[22] Prediction MIMIC FS ASU datasets Cross-validation Weka Good performance Y Y Y -
[23] Marker selection FIFS Single nucleotide polymorphism (SNP) Train and test Huge rate of success Not parallelized Y Y Y
[24] Binning for prediction Random forest Naïve Bayes Generated datasets Train and test Dataset presents better prediction Y Y Y
[25] Classification predicting disease SVEGA Breast cancer dataset Kent ridge biomedical repository TPR/FPR Classification accuracy rate Not parallelized Y Y - - - Y - - -
[26] Classification prediction SVM Kent Ridge Bio-medical dataSet Repository and National center of Biotechnology Information ANOVA Hadoop
Good accuracy rate Y Y Y
[27] Classification prediction K-nearest neighbor National Center of Biotechnology Information NCBI GEO Cross-validation Hadoop
Reduces time complexity
[28] Identification of gene expression signatures SVM 20,475 features in 1920 samples, a highdimensional dataset (source not mentioned) Cross-validation Weka Better understanding of the classification Y Y Y
[29] Prediction Cox-regression The Cancer Genome Atlas datasets, glioblastoma) and lung adenocarcinoma Cross-validation Higher true variables rate
Better predicting performance
Easy-to-implement property
[30] Prediction mRMR IFS Genome-wide association studies Cross-validation Weka Good classification performance Not parallelized Y Y Y
[31] Prediction mRMR IFS UniProtdatabase Cross-validation Weka High prediction accuracy Not parallelized Y Y Y
[32] Classification ROSEFW-RF Generated with the ROS technique Train and test MapReduce Parallelized
Suitable for large scale data
[33] Genetic association Screening GEO database with ID GSE13355 and GSE14905 Cross-validation Good classification accuracy Y Y Y
[34] Classification Decision Tree Support Vector Machine UCI machine learning repository Cross-validation Weka Simplicity of implementation
Reduces time complexity
High accuracy
Expensive computational cost Y Y Y
[35] Prediction Pearson Correlation Coefficient (PCC) Information Gain (IG) and ReliefF Prokaryotic model organism name as E. coli, as a real biologic network Fitness function High speed and prediction accuracy
Easily parallelizable
[36] Classification SVM Sentiment classification of on-line reviews using data collected from amazon, imdb, and yelp. Cancer classification based on gene ex-pressions for leukemia, prostate cancer, and lung cancer Hold-out validation Simplicity and low error rates Lack of scalability
Not parallelized
[37] Classification SVM Breast cancer, colorectal adenocarcinoma, head and neck squamous cell carcinoma, kidney renal clear cell carcinoma, ovarian cancer Cross-validation Weka Optimal classification Y Y Y
[38] Clustering Different clustering algorithms Exome dataset of Brugada syndrome (BrS) Suitable for high-dimensional genomic big data No parallel implementation Y Y Y
[39] Classification next generation sequencing SVM Random Forest NCBI Reference sequence database, http:// Cross-validation Hadoop
High classification accuracy
[40] Identification of genetic markers prediction Sparse Regression SNP: a database of single nucleotide polymorphisms Cross-validation - Good accuracy for selection of features Not always trivial Y Y Y
[41] Prediction detecting SNP interactions LogicFS-GPU Stimulated and real schizophrenia data set Cross-validation MapReduce Parallel design of the algorithm Expensive computational cost Y Y Y
[42] Sequencing PrefDiv and MGM PC-Stable Pathway information database Cancer Genome Atlas (TCGA) Cross-validation - Combining two algorithms to inhance accuracy Y Y Y
[43] Prediction Fireflies and ant colony PDB Bank dataset Varibench Protein data Lung Cancer data bank Marketing TPR/FPR Matlab High efficiency for feature selection Y Y Y
[44] Classification for prediction ANOVA and K-Nearest Neighbor NCBI GEO Leukemia Ovarian Cancer Breast Cancer ANOVA MapReduce Distributed and scalable Y Y Y
[45] Classification for prediction Decision tree k-nearest-neighbor Brugada syndrome at Centre for Medical Genetics Cross-validation Weka Good prediction accuracy
Good with heterogeneous data
[46] Classification Real-life biomedical data, SNP repository data; mixture models simulation studies Cross-validation MapReduce High classification performance
[47] Classification Graph datasets of protein 3D-structures Cross-validation MapReduce Improves prediction accuracy Y Y Y
  1. Vo volume, Va variety, Ve velocity, F filter, W wrapper, Em embedded, H hybrid, En ensemble, I integreted
  2. (–) Refers to a lack of definition in the referenced papers
  3. (Y) Refers to the suitability with the titles