Improved cost-sensitive representation of data for solving the imbalanced big data classification problem

Dimension reduction is a preprocessing step in machine learning for eliminating undesirable features and increasing learning accuracy. In order to reduce the redundant features, there are data representation methods, each of which has its own advantages. On the other hand, big data with imbalanced classes is one of the most important issues in pattern recognition and machine learning. In this paper, a method is proposed in the form of a cost-sensitive optimization problem which implements the process of selecting and extracting the features simultaneously. The feature extraction phase is based on reducing error and maintaining geometric relationships between data by solving a manifold learning optimization problem. In the feature selection phase, the cost-sensitive optimization problem is adopted based on minimizing the upper limit of the generalization error. Finally, the optimization problem which is constituted from the above two problems is solved by adding a cost-sensitive term to create a balance between classes without manipulating the data. To evaluate the results of the feature reduction, the multi-class linear SVM classifier is used on the reduced data. The proposed method is compared with some other approaches on 21 datasets from the UCI learning repository, microarrays and high-dimensional datasets, as well as imbalanced datasets from the KEEL repository. The results indicate the significant efficiency of the proposed method compared to some similar approaches.

The high size of the data set makes feature selection one of the most fundamental and important topics in machine learning. As the number of features increases, the efficiency of learning algorithms initially increases, but from a certain point onwards, increasing the number of features not only does not improve the performance of the machine learning algorithm, but sometimes reduces the performance of these algorithms. In addition to this problem, with increasing the number of features, the need for more data samples increases, which increases the temporal and spatial complexity of the problem [3]. The features in the data can be divided into three general categories: Irrelevant features: These are features that have little information load and have nothing to do with data mining goals, so they usually reduce the performance of data mining algorithms. Redundant Features: Features that relate to other features and are not directly unrelated, such as features that can be used to compute other features. Relevant features: These are features that have a great impact on the data classification accuracy and are the main purpose of feature selection methods.
The dimensionality reduction techniques can be classified into two main groups [4][5][6]: feature selection and feature extraction. In feature selection, best features are selected based on their contribution to the final performance of the approach. In this type of dimensionality reduction some information can be lost. On the other hand, feature extraction or mapping approaches aim to find another representation of the data so that the new representation (i.e. features) improves some specific criteria such as maintaining the information [i.e. Principle Component Analysis (PCA)], discriminating the classes [i.e. Linear Discriminant Analysis (LDA)] or maintaining the local or global structure of the data (i.e. manifold learning methods). In feature extraction, the dimension can be decreased without losing much initial feature information [4,[7][8][9].
High dimensionality and imbalance are common problems in microarray data. Imbalance is one of the major crises in classification and the challenge becomes more acute when the data set has a large number of features. Traditional classification usually favors the majority class for attribute selection, leading to poor performance for parameter setting or selecting attributes that better describe the majority class. In order to solve the problem of imbalance, there are different solutions that can be divided into two general categories based on data and model. In data-based methods, an attempt is made to strike the expected balance by reducing the majority class data or generating data from the minority class distribution. In model-based methods, an attempt is made to build a model that is sensitive to the cost of incorrectly classifying minority class data. These methods are called cost-sensitive methods or model-based methods for short.
In this manuscript, we look for a space in which data that are similar in nature are inherently close. We do not reduce the number of our samples, but try to have a new space in which we can better represent the data. A new space-based method in which data is better separated and the problem of data imbalance in the reduced space is considered. The general purpose of this paper is to provide a model by considering the problem of data balance, solving the optimization problem in a combined method of selecting and extracting features simultaneously and improving accuracy and precision.
The organization of the article is as follows. Related works are studied in "Related works" section. The proposed method is presented in "The proposed method" section. "Experiments" section evaluates the proposed method and also compares its performance with other methods. Finally, "Conclusion" section concludes with a discussion and conclusion.

Related works
Feature reduction as a preprocessing step can remove irrelevant data, noise and additional features. Feature reduction is based on two main methods of feature selection and feature extraction. In this section, we discuss some important related works with regard to this classification [4].

Feature extraction
Feature extraction methods extract new features from the original dataset, and is very useful when we want to reduce the number of resources required for processing without losing the relevant feature dataset [4]. Instead of deleting a few feature, the input data space changes. When data is mapped from the input space to a space with a smaller dimension by a transformation; the nature of the basic features are changed.
Principle Component Analysis (PCA) is a simple feature extraction and embedding method. PCA 1 is an unsupervised method which is widely used as a baseline feature mapping approach but it performs the task in an unsupervised manner. In contrast, our method is actually an supervised method that uses labels in the feature reduction process. Authors of [13] present a dimension reduction algorithm for information spaces. This algorithm reduces the dimensions of space by maintaining a simplex structure, and is able be used as a black-box method to speed up algorithms which operate in information divergence spaces. It shows how to embed information distances like the x 2 and Jensen Shannon divergences efficiently in low dimensional spaces while preserving all pairwise distances. Other than the definition of the feature space, the main difference between this method and the proposed approach is that we consider the miss-classification cost in feature extraction which aims to solve the imbalanced data problem.
In [10] a method is presented that is effective in dealing with data with multiple views. In this type of data, there are several different views for each data at the same time. It searches for a space called x with a lower dimension than the input space, which has information from all views and can also be returned from that space to all views.

Feature selection
The main purpose of feature selection is to select the appropriate number of attributes to perform the classification tasks [12].
Conventional feature selection methods perform feature selection operations over the entire sample space. The filter-based local feature selection algorithm [14] is proposed based on the artificial immune system, which determines a subset of the relevant local feature for each area adjacent to the sample space. This algorithm introduces a selection algorithm to optimize the search space for attribute subsets and adopts the idea of local clustering as an evaluation criterion that maximizes between class distances and minimizes within class distance.
Ref. [9] tries to select good features by optimizing multivariate criteria based on sparse representation. To measure the complexity of classification under different feature spaces, first a feature evaluation criterion is proposed, called counting region covering (CRC). Then, by simultaneously optimizing the classification error rate and the separation boundary complexity, a feature selection framework is provided. The proposed approach in [15] performs the feature selection process based on the minimization of the generalized error limit. This method simultaneously performs the classification feature reduction. In this method, a linear model inspired by the support vector machine is used, which performs the classification operation on the reduced dimensional data. Therefore, feature selection and classification are done simultaneously.
In [10], multiple kernel learning feature selection (MKL-FS) uses kernel methods to search for the complex properties of each feature. However, the available kernels are usually limited to positive constraints. In fact, certain negative kernels can often perform better in real applications. However, due to the non-convexity of indeterminate kernels, most methods are usually not practical and relevant researched are relatively limited. Also, a two-step algorithm for optimizing indefinite kernel support vector machine (IKSVM) and kernel combination coefficients is proposed. In [11] an approach is proposed to classify web text documents using the benefits of a hierarchical structure to remove words from attribute vectors that are not related to the Word-Net lexical categories.
In [16], a criterion for evaluating the selected features based on the quality of the features is presented. The main idea is to use a sparse representation to test each feature independently. Also, the feature-based classification method has been used to evaluate the proposed method. Authors of [17] propose an effective distance-based feature selection (ED-Relief ) method, which is used as a complex distance measurement to deal with the simultaneous optimization of within class and between class distances also in [18], an algorithm for feature selection based on associative rules and an integrated classification algorithm based on random sampling are proposed.
Authors of [19], introduced a new sample augmentation method called MAHAKIL. They believed that selecting samples very close to their neighbors would result in little variation in the samples produced in the minority class. Therefore, they used the characteristics of the two samples as parent samples to produce a new sample. A new feature selection method based on interaction information (II) is proposed in [32] to provide high-level interaction analysis and improve the search method in the feature space.
Ref. [20] propose a new hybrid feature selection called the IGIS algorithm for selecting features based on interaction information. This algorithm uses the JMI 2 criterion to find candidate attributes to add to the attribute set and adds one attribute to the currently selected subset at any time. By adding an attribute to the selected attribute set, the attribute list is recalculated.
Interacting features are those that appear to be irrelevant or weakly relevant with the class individually, but when it combined with other features, it may highly correlate to the class. Interacting features are feature that, among other features, may be related to the class. Those appear to be separately irrelevant or weak with the class. Discovering feature interaction is a challenging task in feature selection. In [21], a novel feature selection algorithm considering feature interaction is proposed. Mutual informationbased feature selection algorithms, although performing well in many cases, currently suffer from two drawbacks: (1) Ignoring feature interaction. (2) Over-estimation of some features. To overcome these shortcomings, [22] proposes a new filter feature selection algorithm based on WJMI-weighted mutual information. Prevents over-estimation of some features by considering feature interaction.

Imbalance learning
One of the main and simple methods of data reduction for imbalanced datasets is presented in [23], which accidentally deletes some of the majority class data. The Condensed Nearest Neighbor Rule (CoNN) method [24] uses a similar approach to remove samples that are farther apart than the majority of the data. Unlike [24], the [25] eliminates noisy and near-boundary specimens. In [26], a clustering method is used to maintain the distribution of minority and majority class data after deleting the data. An approach based on evolutionary algorithms has also been performed in [27] in which the selection of samples for deletion is done as a search problem.
One of the most popular data generation methods for the minority class is called the SMOTE method [28]. In this method, samples for the minority class are generated through the interpolation of neighboring data. Some approaches have been proposed to address SMOTE weaknesses. In [29], an SMOTE-inspired method to reduce the tendency to overlap between majority and minority classes is used which is called Borderline-SMOTE. Also LN-SMOTE [30] and safe-level SMOTE [31] are research methods development [32].
In [33] a method called cost-sensitive high-margin support vector machines is proposed. In this method, the goal is to increase the margin of the minority class and reduce the margin of the majority class. This operation is performed by manipulating the cost parameter C in the support vector machine and dividing it into C + for positive class data and C − for negative class data. If the positive class is in the minority, the C + parameter is selected as a larger number and vice versa. In [34], a cost-sensitive issue is included in the extreme learning machine (ELM) classification with a similar approach.
In [35], a new risk forecasting method is proposed as imbalanced classification and solves the feature selection problem. In particular, a high-margin loss function is presented in which the weight of the samples is involved. Accordingly, an optimization objective function is designed with a soft tuning of one to improve performance, which is solved in an iterative context. SMOTE-based class-specific learning is proposed in [36] and uses minority sampling in the kernel space to solve the class imbalance problem. Motivated by weighted kernel-based SMOTE (WKSMOTE), this method proposes a SMOTE class-specific extreme learning machine (SMOTECSELM), a class-specific extreme learning machine (CS-ELM), which takes advantage of minority and class-specific sampling.

Other works
In [37], a low-rank regression model is proposed for feature extraction and feature selection from images without vectorization. To effectively solve the objective function, an optimization-based alternative to Lagrangian coefficients has been developed. In [38], by extracting the features and selecting the features in a cascading manner, and the Pigeon Inspired based Optimization (PIO) method is used to select the features. In [39], a hybrid approach is performed simultaneously by reducing the majority data with the rough set theory and increasing the minority data using the SMOTE method simultaneously. These methods and methods similar to [39] and [40] are among the data-based hybrid methods. In [41], a method has been designed to select a feature that emphasizes two issues. One is the problem of class imbalances and the other is the large size of the data. Also, [39] proposes a costsensitive approach in the context of concave optimization problem and proposes the solution through a Newtonian-like process. In order to prevent the explosion of data space dimensions while maintaining the statistical coherence of a part of the data set selected for teaching, [42] has developed an approach for selecting training data based on Pareto analysis performed on classification descriptors. It also provides empirical evidence that this approach retains its validity, even when compared to traditional space-reduction methods and classical machine learning algorithms.
Ref. [43] proposes a new framework that makes it possible to identify anomalous data points in large volumes of data with high dimensional problems. Authors of [44] proposed methods for reducing the number of variables that include more information and for reliable classification, as well as several methods for reducing dimensions and classification.
The tasks reviewed generally use only one feature selection or extraction operation to reduce the size. Of course, hybrid methods were also available, but their efforts are mainly aimed at combining filter, embedded and wrapper approaches, and do not combine the main categories of selection and extraction at the same time. In this paper, by introducing the proposed method in the form of an optimization problem that simultaneously solves the feature selection and extraction problem, a context for using the benefits of both approaches is provided. The proposed optimization problem also includes a cost-sensitive function that is designed to make the model resistant to data with imbalanced labels, while creating a balance without manipulating the data.
The term cost-sensitive refers to creating resistance in the feature reduction process to imbalanced data. The existence of this resistor is embedded within the proposed optimization problem. Therefore, the proposed method is used based on the feature space to solve the problem of imbalance.

The proposed method
Assume that X ∈ R d×n is a data matrix that represents n data d dimensional in which x i is equal to i-th data point. The above data label is represented by the vector Y = y 1 . . . y n where y i ∈ {−1. + 1} . To clarify the rest of the proposed approach, Table 1 is included which summarizes the frequently used notations. Matrix Z ∈ R m×n is a reduced latent representation of Z where m ≪ d . In other words, the Z matrix is the result of the feature extraction operation on X, which can be generated by the following mapping:  where Q ∈ R d×m is the mapping matrix and ε ∈ R d×n is the reconstruction error. In order to minimize the reconstruction error, a simple solution can be the soft minimization of the following equation: Suppose that the feature selection is done on input space through a diagonal matrix σ such that diag(σ ) consists of 0 and 1 where σ j = 1 , denotes that the jth feature is selected from a dataset and vice versa. Assuming feature selection by σ , Eq. (2) will be written as follows: Adding the regularization term to the above optimization problem, the following equation is obtained: where the non-negative parameter C indicated the penalty for the regularization term. In addition, to control the Q matrix and prevent large values, Q T Q = I constraint is added to the objective function. Therefore, the following optimization problem is obtained: In order to increase the discriminability of classes and preserving the primitive geometry structure in the reduced space Z , another term is also added to objective function named as locally alignment. The following problem is attained by adding this term to an objective function. LA(Z) will be discussed later.
where LA(Z ) would be as: Minimization of Eq. (7) means that distance of reduced data z i from k 1 intra-class nearest neighbors should be decreased and the distance from k 2 farther neighbors from the opposite class should be increased. In Eq. (7), z ij denotes the jth nearest neighbor for ith data point with the similar label; and z ip is the pth nearest neighbor for ith data from the opposite class. Also, the parameter β indicates the importance level of the second term in the equation. It must be noted that LA makes the algorithm supervised because the class labels must be known to compute intra-class or interclass neighbors.
The optimization problem is similar to the support vector machine (SVM) optimization [10] and is formulated as follows which performs simultaneous feature selection and classification.
In (8), ξ is the slack variable and 1 is a vector of ones having the length equal to ξ , the matrix multiplication of 1 T ξ denotes the summation of the ξ values. Also, ω denotes the coefficients of separating hyper-plane, b is the bias, and z ij is the jth feature of the ith data. Also, Which means the empirical mean of the second order moment of the jth feature in the negative and positive classes respectively. n − and n + are the number of negative and positive class data, respectively.
One of the constraints in (8), is to prevent from exorbitant growth of m j=1 ν + j σ 2 j and m j=1 ν − j σ 2 j which have the upper bound R + and R − respectively. It is evident that by choosing a low value for R + and R − , the optimum solution for σ would include more zero values and consequently it results in a higher feature reduction rate.
Another constraint in the problem (8), i.e. ω T ω ≤ 1 , is the bounding over l 2 -norm of ω which means a regularization role. Now the final optimization problem resulting from combining the two relations (6) and (8) is as follows: In the above optimization problem, the input data is assumed to be balanced. This means that if there are two classes of data, approximately equal amounts of data are available from each of them. Since this assumption is not always true in real-world datasets, a cost-sensitive function is added to reinforce the above optimization problem, such as [21]. Therefore, the following optimization problem will be resulted with the same constraints in Eq. (8): In problem (11), the slack values for the positive and negative class data are added with different coefficients. Suppose the positive class is minor; therefore, to prevent the model from deviating towards that class, the C + coefficient can be selected as a larger number than C − . This means that since the slack of the positive samples will be more fined, the model will not deviate towards them.
The cost-sensitive optimization problem now becomes: Solving the proposed cost-sensitive optimization problem, Eq. (12), leads to the fact that in addition to feature extraction, feature weighting is performed simultaneously. The object of the problem is to find z i , which is the result of extracting the property on x i σ . Also, since the optimization problem is cost sensitive due to the lack of proper classification of positive and negative classes, this leads the feature reduction process to the output properties that are not only suitable for the majority class but also for the minority class.
The proposed problem, causes more separation in the reduced space in two ways. One for the existence of the expression LA and the other for the existence of the Feature extraction as a solution to a reduced manifold learning optimization problem is based on error reduction and maintaining geometric relationships between data. Also, in order to select the features, optimization problems based on the minimization of the above the generalization error have been adopted. Finally, the optimization problem combined from the above two problems is solved by adding a cost-sensitive expression to create a balance without manipulating the data in the imbalanced data. The flowchart of the proposed approach in illustrated in Fig. 1.

The pseudo code
The steps of the optimization algorithm are denoted in Alg. 1.
Algorithm I An iterative solution to optimize the proposed problem   Input: data matrix X = {x 1 . . . x n } where x i ∈ R d i = 1 . . . n and labels y i ∈ {−1. + 1} Output: dimensionality reduced data z i ∀i = 1 . . . n , ω . and b.

7: end for
As it may be inferred from the above algorithm, there is a loop in line 2 which is iterated max_iteration times. In the loop, z i is first computed which as seen, has a constant time with respect to the number of samples and features. Therefore, calculating the whole Z has a time complexity of O(n). Calculating Q is the most complex stage of the approach. The first parenthesis on Q is calculated in O(nm) while the second parenthesis has the same complexity. Therefore, the complexity of finding Q is O(2nm). For solving Eq. (12), we have six summations which are computed over n. Therefore, considering approximately constant operations in each summation, the computational complexity of the last stage in O(cn). Table 4 The best values of the parameters obtained by the PSO evolutionary algorithm

Hyper parameters Value Definition
R + 20 It is evident that by choosing a low value for R + , the optimum solution for σ would include more zero values and consequently it results in a higher feature reduction rate R − 50 It is evident that by choosing a low value for R − , the optimum solution for σ would include more zero values and consequently it results in a higher feature reduction rate C 0.1 The non-negative parameter C indicated the penalty for the regularization term Having all these complexities for max_iteration times, the total time complexity of the algorithm will be O(max_iteration * (n + 2 nm + cn)) in which n is the number of samples and m is the number of final extracted features. Therefore, as seen, the time complexity of the algorithm is a linear function of n and m which is not much as compared to similar approaches and implementing the approach in parallel will decrease the run time of the algorithm.

Experiments
In this section, we simulate the proposed method and evaluate the performance of the proposed algorithm. This test needs criteria to be used to measure the performance of the algorithm. Here, after stating the conditions and implementation environment, a description of the evaluated data set, setting the algorithm parameters, the desired criteria are introduced, and then, the efficiency of the proposed method in terms of these different criteria is compared with other methods.

Experimental setup
In order to evaluate the effectiveness of the proposed method, the data collections from the UCI machine learning repository, high-dimensional microarrays datasets as well as imbalanced datasets from the KEEL repository have been used.
Attempts are made to use data that is large in size so that a number of appropriate features are reached when the dimension reduction is performed. Also, the use of data with more imbalance in their labels will lead to a better evaluation of the performance of the proposed method in the face of imbalanced data. Specifications of the test classification datasets are mentioned in Table 2.
The size of the datasets in this article varies from 62 to 10,000, the number of their features varies from 4 to 7129, and the number of classes from 2 to 101. The imbalance ratio in Table 2 is calculated by the following equation [45]: Table 3, the number of selected features for each data set is specified. As a heuristic, the integer value of half the number of attributes in each data set is used as the number of selected features. (13) Imbalance ratio = size of majority class size of minority class .

Evaluation method
To evaluate the results of the approach, the multi-class linear SVM classifier is used on the reduced data. K-fold cross validation method is used to perform the experiments. MATLAB libsvm library is used for SVM implementation and evaluation and linear kernel with C = 1 is used as the SVM settings. The one-vs-one model is used to classify the proposed method for multiclass data.

Parameter settings
The best values of the parameters are obtained by the Particle Swarm Optimization (PSO) algorithm on the validation data set. Given that there are different parameters in the proposed method, the parameter values β is set as 0.0001, C is 0.1 and the number of neighbors are considered as 3, and also the values of the superparameters of the problem is set as 20 and 50 in the experiments. The values for all hyper-parameters used in the proposed method are presented in Table 4.

Experimental design
MATLAB 2018 software has been used to implement the proposed algorithm. The tests were also evaluated on a PC with Intel Core i3 processor, 4 gigabytes of RAM, and Windows 8.1 operating system.

Evaluation criteria
To compare the performance of the proposed method F-score and accuracy criteria are used.
where TP denotes true positive, TN is true negatives, FP is false positives, and FN denotes false negatives. Two other criterions are used in the experiments which are:    . 2 The performance of the proposed cost sensitive method, S-MVML-LA [15], GMEB [10] and IGIS [20] versus the changes in percentage of selected features numbers over the cancer dataset. a Accuracy, b f-score Fig. 3 The performance of the proposed cost sensitive method, S-MVML-LA [15], GMEB [10] and IGIS [20] versus the changes in percentage of selected features numbers on the deramatology-6 dataset. a Accuracy, b f-score Fig. 4 The performance of the proposed cost sensitive method, S-MVML-LA [15], GMEB [10] and IGIS [20] versus the changes in percentage of selected features numbers on the Ionosphere dataset. a Accuracy, b f-score The performance of the proposed cost sensitive method, S-MVML-LA [15], GMEB [10] and IGIS [20] versus the changes in percentage of selected features numbers on the kddcup_land_vs_satan dataset. a Accuracy, b f-score Fig. 6 The performance of the proposed cost sensitive method, S-MVML-LA [15], GMEB [10] and IGIS [20] versus the changes in percentage of selected features numbers on the musk dataset. a Accuracy, b f-score Fig. 7 The performance of the proposed cost sensitive method, S-MVML-LA [15], GMEB [10] and IGIS [20] versus the changes in percentage of selected features numbers on the WDBC dataset. a Accuracy, b f-score Due to the imbalance in the data set, one of the RCL and PRC criteria may be too high or too low, so their geometric mean is considered in our experiments.
As in Table 5, the proposed method has performed better than the other methods in most cases in term of the accuracy. The numbers in parenthesis denote the rank of the approach on each dataset and the final row denote the average rank of each method. As seen in the table, the average rank of the proposed method has the highest value among other methods. After the proposed method, the GMEB method was able to obtain a better ranking, and finally the IGIS method is lower than the others. Not available (NA) values in the following tables, is due to the fact that the corresponding datasets are not evaluated in the reference articles.
In Table 6, the proposed cost-sensitive method has the highest average ranking. The IWFS method average rank is 0.92 worse than the proposed method and is ranked second in total.
Again in Table 7, the proposed method has a better result than other methods. The proposed method has the highest average rank value. In the three datasets Kddcup-buffer-overflow-vs-back, Kddcup-land-vs-satan, Kddcup-rootkit-imap-vsback, the proposed method had similar performance with GMEB.
As shown in Table 8, the performance of the proposed method is better in most cases. Comparing the proposed cost-sensitive method with other methods, the GMEB approach has superior performance on iris dataset, Kddcup-rootkit-imapvs-back and Dermatology-6 while the S-MVML-LA approach cannot compete with the two other approaches. However, on several data sets, the proposed method has a high performance and the average rank of the method has the highest value.
In Table 9, the methods are evaluated with the g-mean criterion. This criterion is one of the most accurate evaluation criteria for imbalanced data sets. As seen in the results, the cost-sensitive method performed better on most datasets and the proposed approach has again the highest average rank.
In Table 10, the average execution time of our method compared to other approaches is shown. As seen and could be predicted, the proposed approach has a relatively high execution cost on most of the datasets as compared to other evaluated approaches. However, the approach is still better than the IGIS method on Movement_libras, Colon, and Iris datasets. Also, on the Caltch101 (with 784 original features), Mnist (with 5469 original features) and DLBCL77 (with 784 original features), which are among the highest dimensional evaluation datasets, our method has the best convergence time than all other methods. These observations show that when it comes to high dimensional data, the other compared approaches degrade and our algorithm is more effective than them. But with lower dimensional data, the proposed approach has higher running time which is not considerable when the whole execution time is in the scale of 1 or 2 s. As shown in Table 10, GMEB approach has the lowest execution time on 14 datasets most of which has the original dimensionality of 100 or less. Therefore, these evaluations also demonstrate the effectiveness of the approach in high dimensionality.
To better demonstrate the results of the experiments, the performance of the approaches versus the percentage of the selected features are depicted in Figs. 2, 3, 4, 5, 6 and 7.
On the deramatology-6 dataset, the proposed cost-sensitive method has performed better than other methods from the beginning. It still performs well when the number of features increases. With this interpretation, it can be concluded that our method has selected relatively good features that show a better result. Also, on the ionosphere dataset, the cost-sensitive method initially performed better than other methods with less specificity. In the case of the IGIS method, it performs badly at the beginning (i.e. when the number of selected features is low) and then gets better. This means that when a feature increases, it works well. It can be concluded that the features it has selected must not have been good features. On the other hand, if 20% of the features are selected using the proposed approach, the performance is well and stable. With the S-MVML-LA approach, the diagram slope has gradually increased, and then suddenly drops strangely. The performance contour of the proposed method is above the others from the beginning, and maintains the performance which shows the superiority of the approach.

Conclusion
In this paper, a hybrid was proposed in order to reduce the data dimensionality, which combines feature selection and feature extraction in the context of an optimization problem solving while creating a balance without manipulating the data. In this method, it uses the advantages of feature selection and feature extraction together. In feature extraction, it tries to solve a Manifold learning optimization problem and does feature selection as an optimization problem based on minimization of the general error boundary. In evaluations the accuracy and f-score results are reported on the test data. Comparison results of the proposed method with other methods on 21 datasets from the UCI machine learning repository, microarrays and high-dimensional datasets as well as imbalanced datasets from KEEL repository are reported. The evaluations indicate the superiority of the proposed model over other methods. As the future works, evaluating the proposed approach on real world problems and applications is suggested.