Optimized hybrid investigative based dimensionality reduction methods for malaria vector using KNN classifier

RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is capable of adding to prevailing machine learning methods.

input space called the curse of dimensionality. Overcoming the curse of dimensionality challenges, several dimensionality reduction techniques have been exploited in literature [4]. Determining the optimal subset genes helpful for revealing hidden features of genes and enhance their interpretability is crucial [2]. The dimensionality reduction aim is to discover the trivial subset of genes that can help improve prediction performances, which will be helpful to clinicians in decision making and treatments [4].
Several authors have addressed the problems of the curse of dimensionality. Metaheuristics have also been proposed, yet approaches suffer from correlations, high throughputs, and increase in computational time for fetching gene subsets [5,6]. A systematic approach to fetching an optimal subset gene is a crucial issue.
Feature selection (filter, wrapper and embedded) [7][8][9] and feature extraction [10][11][12] (supervised and unsupervised) are dimensionality reduction approaches that have been established, these approaches have overcome several problems such as performance enhancement, yet there is need for improvements hybrid model and optimization for getting better results [13]. Finding an optimal subset of genes proficient at handling high dimension optimization difficulties with reasonable solutions is required [5].
Genetic algorithm (GA), is a feature selection technique; it is a wrapper-based which is represented by an optimization technique. GA is said to be adaptive heuristic search approach that finds an optimal subset of features in complex problems such as high dimensionality [14]. An associated problem with high dimensional data is overfitting "comprising of more model parameters" and curse of dimensionality "increase in basic error" [15]. To predict classification, a discriminatory genes subset is essential to avoid overfitting and to fit "correlate" irreplicable training sets. This helps in achieving predictive accuracy [13].
GA is proficient for finding optimal subsets on high dimensional data and have been used extensively, yet they are computationally expensive and prone to overfitting [14]. Overcoming this limitation, optimization strategies have been used to ensure better performances for finding optimum feature subsets and classification accuracy [14].
Principal component analysis (PCA) (non-linear) and Independent component analysis (ICA) (linear) are appropriate feature extraction methods that have been extensively used [10,15] are standard capable methods for fetching subset of gene samples for classification and have received growing attention in recent time [16]. The hybrid approach has proven to be significant, due to their excellent performances and advantages for solving dimensionality problems that halt classification, it is of the essence to come up with efficient models that are computationally fast and easy to implement for classification of gene expression data analysis [17].
This study proposes a hybrid dimensionality reduction model for the classification of malaria vector data. Based on the approaches, an optimized genetic algorithm (GA-O) is used to fetch out subset relevant genes. The PCA and ICA are used on the subset data, to fetch latent components in the data. Combining GA-O with PCA and GA-O with ICA, are classified using KNN on a Mosquito Anopheles gambiae dataset. This study proposes to improve the classification complexities such as the computational cost, fetching relevant subset genes and relationship among genes that can be used by clinicians for decision making.
The rest of this paper discusses the; Literature review, the methods and materials, result discussions and conclusions.

Literature review
A lot of dimensionality reduction and classification approaches have been explored in literature, they are based of several measures, such as computational complexity, accuracy, among others [24]. Some of the few works done have been studies in the literature and have indicated emerging research fields such as optimization and hybridization investigations [25].
A dimensionality reduction was proposed with class prediction approach for gene expression data, by suggesting an innovative procedure using feature extraction and feature selection, for gaining correlation of the reduced data and eliminating redundancy respectively. Their approach was tested and compared with the state of the art [26].
A distributed feature selection approach was proposed for classifying gene expression data, by trying to detect possible infected genes in a dispersed way that can help classify samples effectively. The huge data considered subdivided and distributed features among processors, a filter-based approach with fuzzy inference system were applied on the subset data. the result features were ranked and showed a better performance [27].
An optimizing feature selection and classifiers was proposed for gene expression data, by carrying out a hybrid model using mutual information with genetic algorithm on a high dimensional data, the reduced data is passed into partial least square using t-score mechanism. The reduced data is classified to attain an improved classification accuracy of about 93 %, yet calls for enhancement in terms of maximizing the accuracies and minimizing further, the features of genes [25].
A hybridized neuro-fuzzy with feature reduction approach was proposed for classification of data analysis, for dealing with uncertainty issues. Their result showed a considerable improvement in terms of accuracy and elimination of redundancy of information, and proposed solving real life gene expression classification problems [28].
A computational approach for integrating single cell data analysis was proposed, by describing the focus on joint analytical multimodal signals from respective cells, they proposed that years to come, studies of integrative multimodal single-cell data will pose a significant method and application and will be used extensively for characterizing innovative mechanisms [29].
A machine learning technique for malaria outbreak detection was proposed, by utilizing and comparing several techniques such as decision tree, gaussian and logistic regression methods. They found a binary classification question and outbreak results or no outbreak outcomes from the test data samples obtained from Indian Maharashtra. The results of comparable experiments are contrasted with the performance of the models. They were able to detect the samples based on the sample data used. Malaria outbreak in the testing dataset without any false positive or false negative errors [30].

Materials and methods
Datasets RNA-Seq for gene expression data analysis uses the mosquito Anopheles gambiae (Ag.) larvae, from Kenya western region. It comprises of deltamethrin susceptible and resistant mosquitoes' profile with considerate resistance devices; with 7 attributes relating to the Tests, Genes, Genes identities, Locus, Susceptible, Resistant, Status and a predictor from 2457 instances [31] (Table 1).

Methodology
RNA-Seq gene expression data is a widely utilized technology for diagnostic of several diseases, such as cancer, malaria, among others. It recognizes several aspects of transcriptomes, which is a principal existing technology for high-throughput genetic factors. Providing enhanced insight for transcriptome cells, alternative therapies and improved determinations [32], it identifies early secret variations occurring in conditions of disease by reacting to different environments and other training therapeutics, producing sufficient quantities of sequencing data [7,33]. Gene expression Classification of RNA-Seq data has provided valuable evidence to classify and assess German medications for diseases [34]. The expression of genes is genomic factors in the predominant method of RNA-Seq quantifying and gaining a better understanding of various biological tissues. The problem of diagnostic challenges is a significant challenge for RNA-Seq, and owing to the high dimensional gene data expression, it gives unfitting results.
In this study, the dataset uses a mosquito Anopheles gambiae. The samples of the genes are normalized using the MATLAB 2015 tool. The samples are passed into the optimized genetic algorithm. A reduced sample is then achieved and passed into the PCA and ICA separately. The further reduced data are split into training and testing sets. Classification is conducted using KNN.

Dimension reduction
A recognized technique for eliminating unwanted noise and unnecessary features is dimensionality reduction. Gene expression data comprises of high dimensional features that amount to computational weightiness, depriving the performance of classification models. Eliminating redundancy and obtaining irrelevant features that interrupt efficiency with activity by reducing the samples of feature ratios, dimensionality reduction procedures are essential. This method helps in reducing risks of overfitting. Reducing the dimensionality is a crucial method known as the collection of features and extraction of features [35,36].

Feature selection
Technologies such as RNA-seq transcriptomes, constructing relevant particular feature identifiers for sequences transcript is essential, to train and test models. Feature selection is essential to create a better classification performance. Selection of features allows choosing of suitable elements for in classification model performances by removing irrelevant and redundant features which minimize the curse of dimensionality. It helps to make the classification phase learning procedure successful and increases the success model. For example, extensive information feature selection process; RNA-Seq data involves supervised and unattended decision-making learning. For classification problems, rank characteristics conferring significance are essential, and selecting the best will advance the prediction model's performance. The collection of feature selection is an efficient technique identified as a filter, wrapper and embedded types [37].

Genetic algorithm (GA)
Genetic algorithm is a wrapper based evolutionary algorithm for selecting relevant features, used in investigating engine optimization problems. In the survival of the fittest base, GA is based on actual activities linked to human genetic factors. GA is made up of initial population development, fitness assessment, parent selection, crossover and mutation [38,39]. GA is an investigative discovery method, in a simple procedure, with a sample of randomly generated outcomes (phenotypes or entities) offering an acceptable value for the primary purpose of computing the beneficial results. Respective chromosomes or genotypes typically comprises of sets of properties categorized as binary strings of 0's and 1's [40]. While very sensitive to the initial population, GA has a weakness of optimality. Its result quality declines as problem dimensions rise, it has been shown to produce reasonable quality solutions to boost it for gene sampling.

Feature extraction
Extraction of features is a technique used in the identification of essential features, characteristics or features existing in data. Feature extraction technique examples are the identification of patterns and the detection of public instances in a set of identifications. Data with dimensional loads include the use of feature extraction, for producing a more precise explanation of characterizations. Feature extraction allows revolutionary selected feature variables to decrease the presence of the curse of dimensionality. There are two broad collections of feature extraction procedures, explicitly: non-linear (assumes data on low-dimensional subspace, such as PCA) and linear (assumes a lowdimensional subspace, characterized with a high-dimensional feature, such as ICA) for a non-linear relation between features [18].

Principal component analysis (PCA)
PCA is a method of non-linear feature extraction; it is commonly used primarily in genetic studies. Through reformatting the k-dimensional discrete features from exclusive n-dimension feature field, PCA projects feature spaces from high to lower dimensions. PCA has acknowledged that it is an essential method for the exploration of high-dimensional knowledge on gene expression. It is widely used for RNA-seq data. By transforming a set of correlated variables into a set of uncorrelated variables, investigating orthogonal alteration. PCA for the study of experimental results. PCA may be used to analyze the relationships between a set of variables and to minimize dimensionality [41].

Independent component analysis (ICA)
Disintegrating multivariate signs into independent non-gaussian for statistically independent components, ICA supports finding hidden features from multidimensional details. By decorrelating the data, ICA seeks a connection between information by manipulating or lessening the relevant data. As a linear combination of the independent components I, ICA adopts Opinion X. If B means columns of B define the separate weighted matrix R, the basis feature vectors of observation X.
For biological information, recognition and other reasons, ICA have been used extensively [42].
PCA is a non-linear alteration technique, used to minimize the dimension and number of features. It is a "non-linear" algorithm, while ICA is "linear, " if a data is preprocessed, ICA has been shown to perform better [18].

Classification
In data mining techniques, classification is a supervised learning method. It is a common, supportive task that gives and predicts class labels specified from the predefined class label to current data. The building of classification is comprised of two steps [43]: • The learning process, in which the classification model was developed with a class label giving a collection of training data. • The model predicts the class labels for concealed data and to calculates the accuracy of the KNN classifier.

K-Nearest Neighbor (KNN)
A supervised learning Kth nearest neighbour classification technique for gene datasets performs the benefit of creative application event assessment of neighbourhood classification. The KNN algorithm classifies creative entities based on examples, characteristics and training models. KNN classifiers do not train models to suit but are retention-based. The selected features are assumed to be inputs for segments. The K value of the closest neighbours is selected nearest to the spot of the question. Based on the minimum determined distance of Kth, detachment between query-instance and training models is taken into account and sorted. Group Y is taken from the closest neighbours. The unassuming prevalence of groups of nearest neighbours is used as the approximate number of instances of question. Bonds can fragment randomly [44]. Increasing the dimensionality of biological data is a significant problem for simple, predictable research methods. It is important to use traditional approaches for learning complex strategies on several layers moved by morphological processes interested in processing. Several complexities are involved in most typical procedures used to deal with high-dimensional data, such as the RNA-Seq data. The combination of different methods for reducing dimensionality will, in essence, take advantage of unique advantages where subset genes obtained from a procedure is supported as an input to (1) I = RtoX, X = BtoS the other. In general, feature extraction techniques support feature selection proficiently, by using feature selection to pick the original subset of genes, or by taking advantage of redundant gene elimination. Extracting primary subset features, combining various feature extraction methods can be useful [31,39,45,46].
An effective dimension reduction method to classify malaria vector data was suggested in this report.
RNA-Seq has tremendous potential for finding, defining and tracing cell lines. Still, the reduction of dimensionality helps to perceive the structures. Still, data remains challenging, and current algorithms need the correct development to reveal suitable characteristics, fusion approach proves to be healthy but necessitates effective procedures to model.
The classification technique proposed consists of three Phases, namely: • Selection of features.
• Extraction of features.
• Category of category. Figure 1 illustrates the projected hybrid system for classifying malaria gene expression dataset. The framework consists of three subsystems, a subsystem for feature collection, a class-based subsystem for feature extraction, and a subsystem for classification.
By adopting one algorithm below to pick an optimum subset by assessing the chromosome fitness, the function selection sub-system uses an optimized GA. The function extraction subsystem uses PCA and ICA because of its data projection of efficiency invariance along with impertinent orders. The standard of the researches is categorized using KNN.
Significances of genetic algorithm optimization are its evolutionary dispensation of the algorithm's features; it helps numerous search point which simultaneously and independently explores the optimal result to produce a good result. In this study, an optimization of the collection of genetic algorithm features to minimize numbers of features and Fig. 1 proposed framework for RNA-Seq gene classification maintain discriminant features. The extraction of features is ideal; it transforms reduced data to latent components; the productivity is to lessen prosperity and suffer from both methods of reduction of dimensionality used for classification of malaria.

Algorithm 1:
Experiments in this study are performed using Intel Core 5 with a 16 GB RAM, and 64-bit Operating system. All algorithms were coded in C + + on MATLAB 2015 environment platform.
The confusion matrices were used as the classification evaluation to certify comparable training and testing performances of the experiments in terms of accuracy, sensitivity, among other metrics [31] (Fig. 2).

Results and discussion
This study proposes a malaria vector dataset classification, using a public dataset, with 2457 samples and 7 features [31], on a MATLAB tool. The dataset was investigated using an optimized genetic algorithm to pick pertinent features in the data, using 0.5 thresholds, 708 significant subset features were selected. Classifier ability associated with the state-of-the-art was used for required evaluations.
The selected 708 features by the Optimized Genetic algorithm is first conceded into PCA algorithm with an extracted output of 10 latent variables in 1.4623 seconds. The results of the extracted features are classified using the KNN classification algorithm with 10-fold cross-validation "technique required in evaluating predictive models, it partitions the given sample into training and testing sets for evaluation". The KNN Confusion matrix was then evaluated using the performance metrics analysis.
The 708 selected features furthermore were conceded into the ICA algorithm and extracted 25 latent variables in 0.42794 seconds. The latent features were classified on KNN with 10-folds cross-validation, and the confusion matrix is evaluated.
GAO + PCA + KNN and GAO + ICA + KNN algorithms are carried out on the malaria vector dataset, and the performance evaluations of the experiments are tabulated below.
This study shows numerous significant suggestions for analyzing data gene expressions. The potential application of this experiment is to give relevant understanding into genetic and technical deliberations that can clarify revealed structures and elucidations for genes appropriate for predictions, analysis, detections of malaria infections, transmissions and drug designs (Fig. 3).

The GAO with PCA with K-NN results
The GAO with ICA with K-NN results Table 2, this study attained reliable performances with useful algorithms comparatively.

As stated in
This study proposed a hybrid dimension reduction approach using an optimized Genetic algorithm. PCA and ICA algorithms were used as on the selected features. KNN

Fig. 2
Proposed classification framework for the gene expression data analysis algorithm, using 10-fold cross-validation parameter, was used to classify the experiment. The result showed an enhanced result, as revealed in Table 2. Compared to the state-ofthe-art, the accuracies presented an improvement (Fig. 4). Providing a dependable discovery and prediction method for malaria infection and transmission, numerous investigators have studied underlying classification problems   using machine learning methods. Results achieved can be proposed to train prevalent malaria infections by clinicians, through the use of this procedure to compile curated diagnostic dataset to train classifiers and increase approaches for datasets to increase the dataset size significantly, concerning the overfitting difficulties related the training of datasets. The study of illustrating thousands of genes suggests unfathomable understanding into malaria classification complications with ample of data discoveries, for drug finding, prediction and diagnosis of malaria treatments as well as understanding roles of genes with the communication between the genes in daily and irregular situations. This study grew the classification performance results and demonstrated a less dependence training set (Table 3).

Conclusions
Data analysis of RNA-Seq offers valuable and significant benefits to the technology 's success, with tremendous helps to evolve the problems of gene expression profiling. RNA-Seq's related applications include the reduction of dimensionality and classification approaches. Due to the curse of dimensionality bound in the data of gene expression, it is a critical problem. Several strategies have been proposed to develop the technology, predict and detect diseases extracted from samples, and the reduction of dimensionality has proved to overcome these challenges. Yet, there is a need to undertake further inquiries. Recently, hybrid methods have also been used to classify gene expression results. GA + ICA + KNN outperformed the GA + PCA + KNN based method by performing a dimensionality reduction method using GA with ICA and GA with PCA algorithms discretely and evaluating their performance on KNN classification kernels. The purpose of this study is to present a method to reduce number of variables, while keeping informative ones for enhanced classification, which can be used by clinicians for decision making. This study proposed enhanced dimensionality and classification approach in stages using malaria gene expression data. relevant features retrieved to obtain a better performance measure.
Future work proposes to utilized hybrid dimensionality reduction procedures on other classifiers such as the deep learning, to identify the relevant classification of the gene expression data.

Acknowledgements
The author would like to thank Landmark University for supporting this work with all the needful experiments in this research.