Skip to main content

Table 5 Feature selection methods in big genomic data analytics

From: Feature selection methods and genomic big data: a systematic review

Refs

App in genomics

Algorithm

Datasets

Evaluation methods

Technologies

Advantages

Disadvantages

Big data addressed

Type of feature selection

Vo

Va

Ve

F

W

Em

H

En

I

[17]

Sorting genomes

–

No datasets

–

–

–

–

–

–

Y

–

–

–

–

–

–

[18]

Classification

mRMR

Colorectal, liverM, pancreatic, central nervous system (CNS), leukemia data

Cross-validation

–

Good classification accuracy

–

Y

Y

–

Y

–

–

–

–

–

[19]

Prediction

PFBP

Nucleotide polymorphism (SNP) data

Bootstrapping

MapReduce

Reduces time complexity with better accuracy parallelized

–

Y

Y

–

Y

–

–

–

–

–

[20]

Genetic trait prediction

MINT

Real data: maize data rice data pine data

Cross-validation

–

Reduces time complexity

–

Y

Y

–

–

Y

–

–

–

–

[21]

Prediction

Boruta Random Forest

Next- generation sequencing laboratory of Novogene Bioinformatics Institute, Beijing, China,

Bootstrapping

NetBeans

Good prediction accuracy

Small sample size

Y

Y

–

–

Y

–

–

–

–

[22]

Prediction

MIMIC FS

ASU datasets

Cross-validation

Weka

Good performance

–

Y

Y

–

–

Y

–

–

-

–

[23]

Marker selection

FIFS

Single nucleotide polymorphism (SNP)

Train and test

–

Huge rate of success

Not parallelized

Y

Y

–

–

Y

–

–

–

–

[24]

Binning for prediction

Random forest Naïve Bayes

Generated datasets

Train and test

–

Dataset presents better prediction

–

Y

Y

–

–

–

Y

–

–

–

[25]

Classification predicting disease

SVEGA

Breast cancer dataset Kent ridge biomedical repository

TPR/FPR

–

Classification accuracy rate

Not parallelized

Y

Y

-

-

-

Y

-

-

-

[26]

Classification prediction

SVM

Kent Ridge Bio-medical dataSet Repository and National center of Biotechnology Information

ANOVA

Hadoop

MapReduce

Good accuracy rate

–

Y

Y

–

–

–

Y

–

–

–

[27]

Classification prediction

K-nearest neighbor

National Center of Biotechnology Information NCBI GEO

Cross-validation

Hadoop

MapReduce

Reduces time complexity

Parallelized

–

Y

Y

–

–

–

–

–

–

Y

[28]

Identification of gene expression signatures

SVM

20,475 features in 1920 samples, a highdimensional dataset (source not mentioned)

Cross-validation

Weka

Better understanding of the classification

–

Y

Y

–

–

–

Y

–

–

–

[29]

Prediction

Cox-regression

The Cancer Genome Atlas datasets, glioblastoma) and lung adenocarcinoma

Cross-validation

–

Higher true variables rate

Better predicting performance

Easy-to-implement property

–

Y

Y

–

–

–

Y

–

–

–

[30]

Prediction

mRMR IFS

Genome-wide association studies

Cross-validation

Weka

Good classification performance

Not parallelized

Y

Y

–

–

–

Y

–

–

–

[31]

Prediction

mRMR IFS

UniProtdatabase http://www.uniprot.org

Cross-validation

Weka

High prediction accuracy

Not parallelized

Y

Y

–

–

–

Y

–

–

–

[32]

Classification

ROSEFW-RF

Generated with the ROS technique

Train and test

MapReduce

Parallelized

Suitable for large scale data

–

Y

Y

–

–

–

Y

–

–

–

[33]

Genetic association

Screening

GEO database with ID GSE13355 and GSE14905

Cross-validation

–

Good classification accuracy

–

Y

Y

–

–

–

–

Y

–

–

[34]

Classification

Decision Tree Support Vector Machine

UCI machine learning repository http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

Cross-validation

Weka

Simplicity of implementation

Reduces time complexity

High accuracy

Expensive computational cost

Y

Y

–

–

–

–

Y

–

–

[35]

Prediction

Pearson Correlation Coefficient (PCC) Information Gain (IG) and ReliefF

Prokaryotic model organism name as E. coli, as a real biologic network

Fitness function

–

High speed and prediction accuracy

Easily parallelizable

–

Y

Y

–

–

–

–

Y

–

–

[36]

Classification

SVM

Sentiment classification of on-line reviews using data collected from amazon, imdb, and yelp. Cancer classification based on gene ex-pressions for leukemia, prostate cancer, and lung cancer

Hold-out validation

–

Simplicity and low error rates

Lack of scalability

Not parallelized

Y

Y

–

–

–

–

Y

–

–

[37]

Classification

SVM

Breast cancer, colorectal adenocarcinoma, head and neck squamous cell carcinoma, kidney renal clear cell carcinoma, ovarian cancer http://www.cbio.mskcc.org/cancergenomics/pancan

Cross-validation

Weka

Optimal classification

–

Y

Y

–

–

–

Y

–

–

–

[38]

Clustering

Different clustering algorithms

Exome dataset of Brugada syndrome (BrS)

–

–

Suitable for high-dimensional genomic big data

No parallel implementation

Y

Y

–

–

–

–

–

Y

–

[39]

Classification next generation sequencing

SVM Random Forest

NCBI Reference sequence database, http://http://www.ncbi.nlm.nih.gov/refseq/

Cross-validation

Hadoop

MapReduce

Scalable

High classification accuracy

–

Y

Y

–

–

–

–

–

Y

–

[40]

Identification of genetic markers prediction

Sparse Regression

SNP: a database of single nucleotide polymorphisms http://www.alzgene.org

Cross-validation

-

Good accuracy for selection of features

Not always trivial

Y

Y

–

–

–

–

–

–

Y

[41]

Prediction detecting SNP interactions

LogicFS-GPU

Stimulated and real schizophrenia data set

Cross-validation

MapReduce

Parallel design of the algorithm

Expensive computational cost

Y

Y

–

–

–

–

–

–

Y

[42]

Sequencing

PrefDiv and MGM PC-Stable

Pathway information database Cancer Genome Atlas (TCGA)

Cross-validation

-

Combining two algorithms to inhance accuracy

–

Y

Y

–

–

–

–

–

–

Y

[43]

Prediction

Fireflies and ant colony

PDB Bank dataset Varibench Protein data Lung Cancer data bank Marketing

TPR/FPR

Matlab

High efficiency for feature selection

–

Y

Y

–

–

–

–

–

–

Y

[44]

Classification for prediction

ANOVA and K-Nearest Neighbor

NCBI GEO Leukemia Ovarian Cancer Breast Cancer

ANOVA

MapReduce

Distributed and scalable

–

Y

Y

–

–

–

–

–

–

Y

[45]

Classification for prediction

Decision tree k-nearest-neighbor

Brugada syndrome at Centre for Medical Genetics http://www.uzbrussel.be

Cross-validation

Weka

Good prediction accuracy

Good with heterogeneous data

–

Y

Y

–

–

–

–

–

–

Y

[46]

Classification

–

Real-life biomedical data, SNP repository data; mixture models simulation studies

Cross-validation

MapReduce

High classification performance

Parallelized

–

Y

Y

–

–

–

–

–

–

Y

[47]

Classification

–

Graph datasets of protein 3D-structures

Cross-validation

MapReduce

Improves prediction accuracy

–

Y

Y

–

–

–

–

–

–

Y

  1. Vo volume, Va variety, Ve velocity, F filter, W wrapper, Em embedded, H hybrid, En ensemble, I integreted
  2. (–) Refers to a lack of definition in the referenced papers
  3. (Y) Refers to the suitability with the titles