Optimized Hybrid Heuristic Based Dimensionality Reduction Methods for Malaria Vector Using KNN Classier

RNA-Seq data are utilized for biological applications and decision making for the classication of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classication accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.


Introduction
A major problem in the bioinformatics eld is the collection of genes from high-throughput biological data. The gene expression data are known for having small samples with large irrelevant and redundant noisy genes. Gene expression data analysis comprises of small and large samples with irrelevant and redundant gene sequences. These gene sequences depreciate classi cation learning model performances. Dimensionality reduction techniques have been used severally. It has been used to fetch relevant discriminative subsets from the gene expression data; it also assists in saving computational burdens and improving classi cation prediction accuracy (Pashaei et al. 2019).
In gene expression data analysis, over tting and curse of dimensionality have been known to deteriorate the classi cation capabilities. It comprises of high dimensional input space called the curse of dimensionality. Overcoming the curse of dimensionality challenges, several dimensionality reduction techniques have been exploited in literature. They are determining optimal subset genes helpful for revealing hidden features of genes and enhance their interpretability is a major. The dimensionality reduction aim is to discover the trivial subset of genes that can help improve prediction performances, which will be helpful to clinicians in decision making and treatments (Shukla et al. 2019).
Several authors have addressed the problems of the curse of dimensionality. Metaheuristics have also been proposed, yet approaches suffer from correlations, high throughputs, and increase in computational time for fetching gene subsets (Cai et al. 2018;Marfaja & Mirjalili, 2018). A systematic approach to fetching an optimal subset gene is a crucial issue. Finding an optimal subset of genes pro cient at handling high dimension optimization di culties with reasonable solutions is required.
Genetic algorithm (GA), is a feature selection technique; it is a wrapper-based which is represented by an optimization technique. GA is said to be adaptive heuristic search approach that nds an optimal subset of features in complex problems such as high dimensionality (Chiesa et al. 2020). GA is pro cient for nding optimal subsets on high dimensional data and have been used extensively, yet they are computationally expensive and prone to over tting. Classi cation of RNA-Seq data has provided valuable evidence to classify and assess German medications for diseases. The expression of genes is genomic factors in the predominant method of RNA-Seq quantifying and gaining a better understanding of various biological tissues. The problem of diagnostic challenges is a major challenge for RNA-Seq, and owing to the high dimensional gene data expression, it gives un tting results.
In this study, the dataset uses a mosquito Anopheles gambiae. The samples of the genes are normalized using the MATLAB tool package. The samples are passed into the optimized genetic algorithm. A reduced sample is then achieved and passed into the PCA and ICA separately. The further reduced data are split into training and testing sets. The training set is the vigorous samples, and testing set is the outstanding samples. Classi cation is conducted using KNN.

Dimension Reduction
A recognized technique for eliminating unwanted noise and unnecessary features is dimensionality reduction. Gene expression data comprises of high dimensional features that amount to computational weightiness, depriving the performance of classi cation models. To eliminate redundancy. Obtaining irrelevant features that interrupt e ciency with activity by reducing the samples of feature ratios, dimensionality reduction procedures are essential. This method helps in reducing risks of over tting.

Feature Selection
Technologies such as RNA-seq transcriptomes, constructing relevant particular feature identi ers for sequences transcript is essential, to train and test models. Feature selection is important to create a better classi cation performance. Selection of features allows choosing of suitable elements for in classi cation model performances by removing irrelevant and redundant features which minimize the curse of dimensionality. It helps to make the classi cation phase learning procedure successful and increases the success model. For example, extensive information feature selection process; RNA-Seq data involves supervised and unattended decision-making learning. For classi cation problems, rank characteristics conferring signi cance are important, and selecting the best will advance the prediction model's performance. The collection of feature selection is an e cient technique identi ed as a lter, PCA is a method of linear feature extraction; it is commonly used primarily in genetic studies. Through reformatting the k-dimensional discrete features from exclusive n-dimension feature eld, PCA projects feature spaces from high to lower dimensions. PCA has acknowledged that it is an important method for the exploration of high-dimensional knowledge on gene expression. It is widely used for RNA-seq data. By transforming a set of correlated variables into a set of uncorrelated variables, investigating orthogonal alteration. PCA for the study of experimental results. PCA may be used to analyze the relationships between a set of variables and to minimize dimensionality (Jain & Singh, 2018).

Independent Component Analysis (ICA)
Disintegrating multivariate signs into independent non-gaussian for statistically independent components, ICA supports nding hidden features from multidimensional details. By decorrelating the data, ICA seeks a connection between information by manipulating or lessening the relevant data. As a linear combination of the independent components I, ICA adopts Opinion X. If B means columns of B de ne the separate weighted matrix R, the basis feature vectors of observation X. PCA is a linear alteration technique, used to minimize the dimension and number of features. It is a "nonlinear" algorithm, while ICA is "linear," if a data is preprocessed, ICA has been shown to perform better (Hira & Gillies, 2015).

Classi cation
In data mining techniques, classi cation is a supervised learning method. It is a common, supportive task that gives and predicts class labels speci ed from the prede ned class label to current data. The building of classi cation is comprised of two steps (Arowolo et al. 2017): The learning process, in which the classi cation model was developed with a class label giving a collection of training data.
The model predicts the class labels for concealed data and to calculates the accuracy of the KNN classi er.

K-Nearest Neighbor (KNN)
A supervised learning K th nearest neighbour classi cation technique for gene datasets performs the bene t of creative application event assessment of neighbourhood classi cation. The KNN algorithm classi es creative entities based on examples, characteristics and training models. KNN classi ers do not train models to suit but are retention-based. The selected features are assumed to be inputs for segments. The K value of the closest neighbours is selected nearest to the spot of the question. Based on the minimum determined distance of Kth, detachment between query-instance and training models is taken into account and sorted. Group Y is taken from the closest neighbours. The unassuming prevalence of groups of nearest neighbours is used as the approximate number of instances of question. Bonds can fragment randomly (Bose, 2016).
Increasing the dimensionality of biological data is a major problem for simple, predictable research methods. It is important to use traditional approaches for learning complex strategies on several layers moved by morphological processes interested in processing. Several complexities are involved in most typical procedures used to deal with high-dimensional data, such as the RNA-Seq data. The combination of different methods for reducing dimensionality will, in essence, take advantage of unique advantages where subset genes obtained from a procedure is supported as an input to the other. In general, feature extraction techniques support feature selection pro ciently, by using feature selection to pick the original subset of genes, or by taking advantage of redundant gene elimination. Extracting primary subset features, combining various feature extraction methods can be useful ( An effective dimension reduction method to classify malaria vector data was suggested in this report.
RNA-Seq has tremendous potential for nding, de ning and tracing cell lines. Still, the reduction of dimensionality helps to perceive the structures. Still, data remains di cult, and current algorithms need the correct development to reveal suitable characteristics, fusion approach proves to be strong but necessitates effective procedures to model.
The classi cation technique proposed consists of three Phases, namely:

Selection of features
Extraction of features Category of category Figure 1 illustrates the projected hybrid system for classifying malaria gene expression dataset. The framework consists of three subsystems, a subsystem for feature collection, a class-based subsystem for feature extraction, and a subsystem for classi cation.
By adopting one algorithm below to pick an optimum subset by assessing the chromosome tness, the function selection sub-system uses an optimized GA. The function extraction subsystem uses PCA and ICA because of its data projection of e ciency invariance along with impertinent orders. The standard of the researches is categorized using KNN.
Signi cances of genetic algorithm optimization are its evolutionary dispensation of the algorithm's features; it helps numerous search point which simultaneously and independently explores the optimal result to produce a good result. In this study, an optimization of the collection of genetic algorithm features to minimize numbers of features and maintain discriminant features. The extraction of features is ideal; it transforms reduced data to latent elements, the productivity is to lessen prosperity and suffer from both methods of reduction of dimensionality used for classi cation of malaria.
Algorithm 1: Phase 1: formulate the a and b parameters then establish the initial population arbitrarily.

Results And Discussion
This study proposes a malaria vector dataset classi cation, using a public dataset, with 2457 samples and 7 features (Arowolo et al. 2020), on a MATLAB tool. The dataset was investigated using an optimized genetic algorithm to pick pertinent features in the data, using 0.5 thresholds, 708 signi cant subset features were selected. Classi er ability associated with the state-of-the-art was used for required evaluations.
The selected 708 features by the Optimized Genetic algorithm is rst conceded into PCA algorithm with an extracted output of 10 latent variables in 1.4623 seconds. The results of the extracted features are classi ed using the KNN classi cation algorithm with 10-fold cross-validation. The KNN Confusion matrix was then evaluated using the performance metrics analysis.
The 708 selected features furthermore were conceded into the ICA algorithm and extracted 25 latent variables in 0.42794 seconds. The latent features were classi ed on KNN with 10-folds cross-validation, and the confusion matrix is evaluated.
Dimensional reduced malaria vector data was carried out, using GA-O + PCA + KNN and GA-O + ICA + KNN algorithms and the performance evaluations of the experiments are tabulated below.
This study shows numerous signi cant suggestions for analyzing data gene expressions. The potential application of this experiment is to give relevant understanding into genetic and technical deliberations that can clarify revealed structures and elucidations for genes appropriate for predictions, analysis, detections of malaria infections, transmissions and drug designs.

The Ga-o With Pca With K-nn Results
The Ga-o With Ica With K-nn Results As stated in Table 2, this study attained reliable performances with useful algorithms comparatively.
This study proposed a hybrid dimension reduction approach using an optimized Genetic algorithm. PCA and ICA algorithms were used as on the selected features. KNN algorithm, using 10-fold cross-validation parameter, was used to classify the experiment. The result showed an enhanced result, as revealed in Table 2. Compared to the state-of-the-art, the accuracies presented an improvement.
Providing a dependable discovery and prediction method for malaria infection and transmission, numerous investigators have studied underlying classi cation problems using machine learning methods. Results achieved can be proposed to train prevalent malaria infections by clinicians, through the use of this procedure to compile curated diagnostic dataset to train classi ers and increase approaches for datasets to increase the dataset size signi cantly, concerning the over tting di culties related the training of datasets. The study of illustrating thousands of genes suggests unfathomable understanding into malaria classi cation complications with ample of data discoveries, for drug nding, prediction and diagnosis of malaria treatments as well as understanding roles of genes with the communication between the genes in common and irregular situations. This study grew the classi cation performance results and demonstrated a less dependence training set.    publications for professorial appointments both nationally and internationally. He has published widely, and he is a reviewer for many reputable journals. oludayoo@dut.ac.za

Availability of data and materials
The datasets for this study are available on request to the corresponding author.

Competing interests
The authors declare that they have no competing interests.