From: Feature selection methods and genomic big data: a systematic review
Refs | App in genomics | Algorithm | Datasets | Evaluation methods | Technologies | Advantages | Disadvantages | Big data addressed | Type of feature selection | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Vo | Va | Ve | F | W | Em | H | En | I | ||||||||
[17] | Sorting genomes | – | No datasets | – | – | – | – | – | – | Y | – | – | – | – | – | – |
[18] | Classification | mRMR | Colorectal, liverM, pancreatic, central nervous system (CNS), leukemia data | Cross-validation | – | Good classification accuracy | – | Y | Y | – | Y | – | – | – | – | – |
[19] | Prediction | PFBP | Nucleotide polymorphism (SNP) data | Bootstrapping | MapReduce | Reduces time complexity with better accuracy parallelized | – | Y | Y | – | Y | – | – | – | – | – |
[20] | Genetic trait prediction | MINT | Real data: maize data rice data pine data | Cross-validation | – | Reduces time complexity | – | Y | Y | – | – | Y | – | – | – | – |
[21] | Prediction | Boruta Random Forest | Next- generation sequencing laboratory of Novogene Bioinformatics Institute, Beijing, China, | Bootstrapping | NetBeans | Good prediction accuracy | Small sample size | Y | Y | – | – | Y | – | – | – | – |
[22] | Prediction | MIMIC FS | ASU datasets | Cross-validation | Weka | Good performance | – | Y | Y | – | – | Y | – | – | - | – |
[23] | Marker selection | FIFS | Single nucleotide polymorphism (SNP) | Train and test | – | Huge rate of success | Not parallelized | Y | Y | – | – | Y | – | – | – | – |
[24] | Binning for prediction | Random forest Naïve Bayes | Generated datasets | Train and test | – | Dataset presents better prediction | – | Y | Y | – | – | – | Y | – | – | – |
[25] | Classification predicting disease | SVEGA | Breast cancer dataset Kent ridge biomedical repository | TPR/FPR | – | Classification accuracy rate | Not parallelized | Y | Y | - | - | - | Y | - | - | - |
[26] | Classification prediction | SVM | Kent Ridge Bio-medical dataSet Repository and National center of Biotechnology Information | ANOVA | Hadoop MapReduce | Good accuracy rate | – | Y | Y | – | – | – | Y | – | – | – |
[27] | Classification prediction | K-nearest neighbor | National Center of Biotechnology Information NCBI GEO | Cross-validation | Hadoop MapReduce | Reduces time complexity Parallelized | – | Y | Y | – | – | – | – | – | – | Y |
[28] | Identification of gene expression signatures | SVM | 20,475 features in 1920 samples, a highdimensional dataset (source not mentioned) | Cross-validation | Weka | Better understanding of the classification | – | Y | Y | – | – | – | Y | – | – | – |
[29] | Prediction | Cox-regression | The Cancer Genome Atlas datasets, glioblastoma) and lung adenocarcinoma | Cross-validation | – | Higher true variables rate Better predicting performance Easy-to-implement property | – | Y | Y | – | – | – | Y | – | – | – |
[30] | Prediction | mRMR IFS | Genome-wide association studies | Cross-validation | Weka | Good classification performance | Not parallelized | Y | Y | – | – | – | Y | – | – | – |
[31] | Prediction | mRMR IFS | UniProtdatabase http://www.uniprot.org | Cross-validation | Weka | High prediction accuracy | Not parallelized | Y | Y | – | – | – | Y | – | – | – |
[32] | Classification | ROSEFW-RF | Generated with the ROS technique | Train and test | MapReduce | Parallelized Suitable for large scale data | – | Y | Y | – | – | – | Y | – | – | – |
[33] | Genetic association | Screening | GEO database with ID GSE13355 and GSE14905 | Cross-validation | – | Good classification accuracy | – | Y | Y | – | – | – | – | Y | – | – |
[34] | Classification | Decision Tree Support Vector Machine | UCI machine learning repository http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html | Cross-validation | Weka | Simplicity of implementation Reduces time complexity High accuracy | Expensive computational cost | Y | Y | – | – | – | – | Y | – | – |
[35] | Prediction | Pearson Correlation Coefficient (PCC) Information Gain (IG) and ReliefF | Prokaryotic model organism name as E. coli, as a real biologic network | Fitness function | – | High speed and prediction accuracy Easily parallelizable | – | Y | Y | – | – | – | – | Y | – | – |
[36] | Classification | SVM | Sentiment classification of on-line reviews using data collected from amazon, imdb, and yelp. Cancer classification based on gene ex-pressions for leukemia, prostate cancer, and lung cancer | Hold-out validation | – | Simplicity and low error rates | Lack of scalability Not parallelized | Y | Y | – | – | – | – | Y | – | – |
[37] | Classification | SVM | Breast cancer, colorectal adenocarcinoma, head and neck squamous cell carcinoma, kidney renal clear cell carcinoma, ovarian cancer http://www.cbio.mskcc.org/cancergenomics/pancan | Cross-validation | Weka | Optimal classification | – | Y | Y | – | – | – | Y | – | – | – |
[38] | Clustering | Different clustering algorithms | Exome dataset of Brugada syndrome (BrS) | – | – | Suitable for high-dimensional genomic big data | No parallel implementation | Y | Y | – | – | – | – | – | Y | – |
[39] | Classification next generation sequencing | SVM Random Forest | NCBI Reference sequence database, http://http://www.ncbi.nlm.nih.gov/refseq/ | Cross-validation | Hadoop MapReduce | Scalable High classification accuracy | – | Y | Y | – | – | – | – | – | Y | – |
[40] | Identification of genetic markers prediction | Sparse Regression | SNP: a database of single nucleotide polymorphisms http://www.alzgene.org | Cross-validation | - | Good accuracy for selection of features | Not always trivial | Y | Y | – | – | – | – | – | – | Y |
[41] | Prediction detecting SNP interactions | LogicFS-GPU | Stimulated and real schizophrenia data set | Cross-validation | MapReduce | Parallel design of the algorithm | Expensive computational cost | Y | Y | – | – | – | – | – | – | Y |
[42] | Sequencing | PrefDiv and MGM PC-Stable | Pathway information database Cancer Genome Atlas (TCGA) | Cross-validation | - | Combining two algorithms to inhance accuracy | – | Y | Y | – | – | – | – | – | – | Y |
[43] | Prediction | Fireflies and ant colony | PDB Bank dataset Varibench Protein data Lung Cancer data bank Marketing | TPR/FPR | Matlab | High efficiency for feature selection | – | Y | Y | – | – | – | – | – | – | Y |
[44] | Classification for prediction | ANOVA and K-Nearest Neighbor | NCBI GEO Leukemia Ovarian Cancer Breast Cancer | ANOVA | MapReduce | Distributed and scalable | – | Y | Y | – | – | – | – | – | – | Y |
[45] | Classification for prediction | Decision tree k-nearest-neighbor | Brugada syndrome at Centre for Medical Genetics http://www.uzbrussel.be | Cross-validation | Weka | Good prediction accuracy Good with heterogeneous data | – | Y | Y | – | – | – | – | – | – | Y |
[46] | Classification | – | Real-life biomedical data, SNP repository data; mixture models simulation studies | Cross-validation | MapReduce | High classification performance Parallelized | – | Y | Y | – | – | – | – | – | – | Y |
[47] | Classification | – | Graph datasets of protein 3D-structures | Cross-validation | MapReduce | Improves prediction accuracy | – | Y | Y | – | – | – | – | – | – | Y |