A survey of dimension reduction and classification methods for RNA-Seq data on malaria vector

Arowolo, Micheal Olaolu; Adebiyi, Marion Olubunmi; Aremu, Charity; Adebiyi, Ayodele A.

doi:10.1186/s40537-021-00441-x

Journal of Big Data

Table 1 Overview of major feature selection algorithm approaches and their characteristics

From: A survey of dimension reduction and classification methods for RNA-Seq data on malaria vector

Feature selection method	Algorithms	Characteristics	Benefits and limitations	Assessments
Filter Based approaches	Correlation-based feature selection (CBFS)	evaluates a subset by considering the predictive ability of each one of its features individually and also their degree of redundancy (or correlation)	It is feature dependent but slower than univariate techniques	heuristic merit
	Mutual Information	Examined most probable cancer associated genes, to enhance classification accuracy	Evaluates dependencies of features and classes Features contributes redundancy to classification [43]	Symmetric relationship
	Analysis of Variance (ANOVA) [44]	The dependent variable is continuous and categorized as nominal or ordinal. Its data are normally distributed	It gives overall test of equality of group means It tests against specific hypothesis	Hypothesis test
	Information Gain [45]	It measures known features of a certain relevant and predicted Information, features that frequently occur in positive samples can be obtained	Its evaluation method based on entropy and it involves lots of mathematical theories and complex theories and formulas about entropy	Ranking
	Chi-Square [46]	evaluates the correlation between two variables and determines whether they are independent or correlated
Wrapper Based Approaches	Genetic Algorithm [43]	It mimics evolution by taking population of strings to encode possible solutions and combines them to produce more fit	Produces random population search But has lower training time	Crossover and mutation
	Recursive feature elimination method [47]	Backward selection of predictors that fits models and removes weakest features	Has an essential partitioning predictor Ranks features based on the order of their elimination and multicollinearity	Greedy optimization
Embedded approaches	Info Gain-SVM [48]	Selects attributes and improves correlation	Reduces the effect of bias resulting from information gain. Adjusts each attribute to allow for the breadth and uniformity of the attribute values	Wavelength
	SVM-RFE [49]	makes implicit orthogonality assumptions, it considers a combination of univariate classifiers The decision function is based only on support vectors that are “borderline” cases as opposed to being based on all examples in an attempt to characterize the “typical” cases	lower risk of overfitting	ranking criterion

Back to article page