 Research
 Open Access
 Published:
Improved costsensitive representation of data for solving the imbalanced big data classification problem
Journal of Big Data volume 9, Article number: 60 (2022)
Abstract
Dimension reduction is a preprocessing step in machine learning for eliminating undesirable features and increasing learning accuracy. In order to reduce the redundant features, there are data representation methods, each of which has its own advantages. On the other hand, big data with imbalanced classes is one of the most important issues in pattern recognition and machine learning. In this paper, a method is proposed in the form of a costsensitive optimization problem which implements the process of selecting and extracting the features simultaneously. The feature extraction phase is based on reducing error and maintaining geometric relationships between data by solving a manifold learning optimization problem. In the feature selection phase, the costsensitive optimization problem is adopted based on minimizing the upper limit of the generalization error. Finally, the optimization problem which is constituted from the above two problems is solved by adding a costsensitive term to create a balance between classes without manipulating the data. To evaluate the results of the feature reduction, the multiclass linear SVM classifier is used on the reduced data. The proposed method is compared with some other approaches on 21 datasets from the UCI learning repository, microarrays and highdimensional datasets, as well as imbalanced datasets from the KEEL repository. The results indicate the significant efficiency of the proposed method compared to some similar approaches.
Introduction
In recent decades, with the increase in the size of data in medical science, data called microarray data have emerged. Microarray data is data that is extracted from tissue and cell samples. This type of data are important in diagnosing the disease and types of cancerous masses in medicine [1]. This increase in dimensions increases the computational cost of the system and leads to a decrease in the classification accuracy rate [2].
The high size of the data set makes feature selection one of the most fundamental and important topics in machine learning. As the number of features increases, the efficiency of learning algorithms initially increases, but from a certain point onwards, increasing the number of features not only does not improve the performance of the machine learning algorithm, but sometimes reduces the performance of these algorithms. In addition to this problem, with increasing the number of features, the need for more data samples increases, which increases the temporal and spatial complexity of the problem [3]. The features in the data can be divided into three general categories:

Irrelevant features: These are features that have little information load and have nothing to do with data mining goals, so they usually reduce the performance of data mining algorithms.

Redundant Features: Features that relate to other features and are not directly unrelated, such as features that can be used to compute other features.

Relevant features: These are features that have a great impact on the data classification accuracy and are the main purpose of feature selection methods.
The dimensionality reduction techniques can be classified into two main groups [4,5,6]: feature selection and feature extraction. In feature selection, best features are selected based on their contribution to the final performance of the approach. In this type of dimensionality reduction some information can be lost. On the other hand, feature extraction or mapping approaches aim to find another representation of the data so that the new representation (i.e. features) improves some specific criteria such as maintaining the information [i.e. Principle Component Analysis (PCA)], discriminating the classes [i.e. Linear Discriminant Analysis (LDA)] or maintaining the local or global structure of the data (i.e. manifold learning methods). In feature extraction, the dimension can be decreased without losing much initial feature information [4, 7,8,9].
High dimensionality and imbalance are common problems in microarray data. Imbalance is one of the major crises in classification and the challenge becomes more acute when the data set has a large number of features. Traditional classification usually favors the majority class for attribute selection, leading to poor performance for parameter setting or selecting attributes that better describe the majority class. In order to solve the problem of imbalance, there are different solutions that can be divided into two general categories based on data and model. In databased methods, an attempt is made to strike the expected balance by reducing the majority class data or generating data from the minority class distribution. In modelbased methods, an attempt is made to build a model that is sensitive to the cost of incorrectly classifying minority class data. These methods are called costsensitive methods or modelbased methods for short.
In this manuscript, we look for a space in which data that are similar in nature are inherently close. We do not reduce the number of our samples, but try to have a new space in which we can better represent the data. A new spacebased method in which data is better separated and the problem of data imbalance in the reduced space is considered. The general purpose of this paper is to provide a model by considering the problem of data balance, solving the optimization problem in a combined method of selecting and extracting features simultaneously and improving accuracy and precision.
The organization of the article is as follows. Related works are studied in “Related works” section. The proposed method is presented in “The proposed method” section. “Experiments” section evaluates the proposed method and also compares its performance with other methods. Finally, “Conclusion” section concludes with a discussion and conclusion.
Related works
Feature reduction as a preprocessing step can remove irrelevant data, noise and additional features. Feature reduction is based on two main methods of feature selection and feature extraction. In this section, we discuss some important related works with regard to this classification [4].
Feature extraction
Feature extraction methods extract new features from the original dataset, and is very useful when we want to reduce the number of resources required for processing without losing the relevant feature dataset [4]. Instead of deleting a few feature, the input data space changes. When data is mapped from the input space to a space with a smaller dimension by a transformation; the nature of the basic features are changed.
Principle Component Analysis (PCA) is a simple feature extraction and embedding method. PCA^{Footnote 1} is an unsupervised method which is widely used as a baseline feature mapping approach but it performs the task in an unsupervised manner. In contrast, our method is actually an supervised method that uses labels in the feature reduction process. Authors of [13] present a dimension reduction algorithm for information spaces. This algorithm reduces the dimensions of space by maintaining a simplex structure, and is able be used as a blackbox method to speed up algorithms which operate in information divergence spaces. It shows how to embed information distances like the x^{2} and Jensen Shannon divergences efficiently in low dimensional spaces while preserving all pairwise distances. Other than the definition of the feature space, the main difference between this method and the proposed approach is that we consider the missclassification cost in feature extraction which aims to solve the imbalanced data problem.
In [10] a method is presented that is effective in dealing with data with multiple views. In this type of data, there are several different views for each data at the same time. It searches for a space called x with a lower dimension than the input space, which has information from all views and can also be returned from that space to all views.
Feature selection
The main purpose of feature selection is to select the appropriate number of attributes to perform the classification tasks [12].
Conventional feature selection methods perform feature selection operations over the entire sample space. The filterbased local feature selection algorithm [14] is proposed based on the artificial immune system, which determines a subset of the relevant local feature for each area adjacent to the sample space. This algorithm introduces a selection algorithm to optimize the search space for attribute subsets and adopts the idea of local clustering as an evaluation criterion that maximizes between class distances and minimizes within class distance.
Ref. [9] tries to select good features by optimizing multivariate criteria based on sparse representation. To measure the complexity of classification under different feature spaces, first a feature evaluation criterion is proposed, called counting region covering (CRC). Then, by simultaneously optimizing the classification error rate and the separation boundary complexity, a feature selection framework is provided. The proposed approach in [15] performs the feature selection process based on the minimization of the generalized error limit. This method simultaneously performs the classification feature reduction. In this method, a linear model inspired by the support vector machine is used, which performs the classification operation on the reduced dimensional data. Therefore, feature selection and classification are done simultaneously.
In [10], multiple kernel learning feature selection (MKLFS) uses kernel methods to search for the complex properties of each feature. However, the available kernels are usually limited to positive constraints. In fact, certain negative kernels can often perform better in real applications. However, due to the nonconvexity of indeterminate kernels, most methods are usually not practical and relevant researched are relatively limited. Also, a twostep algorithm for optimizing indefinite kernel support vector machine (IKSVM) and kernel combination coefficients is proposed. In [11] an approach is proposed to classify web text documents using the benefits of a hierarchical structure to remove words from attribute vectors that are not related to the WordNet lexical categories.
In [16], a criterion for evaluating the selected features based on the quality of the features is presented. The main idea is to use a sparse representation to test each feature independently. Also, the featurebased classification method has been used to evaluate the proposed method. Authors of [17] propose an effective distancebased feature selection (EDRelief) method, which is used as a complex distance measurement to deal with the simultaneous optimization of within class and between class distances also in [18], an algorithm for feature selection based on associative rules and an integrated classification algorithm based on random sampling are proposed.
Authors of [19], introduced a new sample augmentation method called MAHAKIL. They believed that selecting samples very close to their neighbors would result in little variation in the samples produced in the minority class. Therefore, they used the characteristics of the two samples as parent samples to produce a new sample. A new feature selection method based on interaction information (II) is proposed in [32] to provide highlevel interaction analysis and improve the search method in the feature space.
Ref. [20] propose a new hybrid feature selection called the IGIS algorithm for selecting features based on interaction information. This algorithm uses the JMI^{Footnote 2} criterion to find candidate attributes to add to the attribute set and adds one attribute to the currently selected subset at any time. By adding an attribute to the selected attribute set, the attribute list is recalculated.
Interacting features are those that appear to be irrelevant or weakly relevant with the class individually, but when it combined with other features, it may highly correlate to the class. Interacting features are feature that, among other features, may be related to the class. Those appear to be separately irrelevant or weak with the class. Discovering feature interaction is a challenging task in feature selection. In [21], a novel feature selection algorithm considering feature interaction is proposed. Mutual informationbased feature selection algorithms, although performing well in many cases, currently suffer from two drawbacks: (1) Ignoring feature interaction. (2) Overestimation of some features. To overcome these shortcomings, [22] proposes a new filter feature selection algorithm based on WJMIweighted mutual information. Prevents overestimation of some features by considering feature interaction.
Imbalance learning
One of the main and simple methods of data reduction for imbalanced datasets is presented in [23], which accidentally deletes some of the majority class data. The Condensed Nearest Neighbor Rule (CoNN) method [24] uses a similar approach to remove samples that are farther apart than the majority of the data. Unlike [24], the [25] eliminates noisy and nearboundary specimens. In [26], a clustering method is used to maintain the distribution of minority and majority class data after deleting the data. An approach based on evolutionary algorithms has also been performed in [27] in which the selection of samples for deletion is done as a search problem.
One of the most popular data generation methods for the minority class is called the SMOTE method [28]. In this method, samples for the minority class are generated through the interpolation of neighboring data. Some approaches have been proposed to address SMOTE weaknesses. In [29], an SMOTEinspired method to reduce the tendency to overlap between majority and minority classes is used which is called BorderlineSMOTE. Also LNSMOTE [30] and safelevel SMOTE [31] are research methods development [32].
In [33] a method called costsensitive highmargin support vector machines is proposed. In this method, the goal is to increase the margin of the minority class and reduce the margin of the majority class. This operation is performed by manipulating the cost parameter C in the support vector machine and dividing it into C_{+} for positive class data and C_{−} for negative class data. If the positive class is in the minority, the C_{+} parameter is selected as a larger number and vice versa. In [34], a costsensitive issue is included in the extreme learning machine (ELM) classification with a similar approach.
In [35], a new risk forecasting method is proposed as imbalanced classification and solves the feature selection problem. In particular, a highmargin loss function is presented in which the weight of the samples is involved. Accordingly, an optimization objective function is designed with a soft tuning of one to improve performance, which is solved in an iterative context.
SMOTEbased classspecific learning is proposed in [36] and uses minority sampling in the kernel space to solve the class imbalance problem. Motivated by weighted kernelbased SMOTE (WKSMOTE), this method proposes a SMOTE classspecific extreme learning machine (SMOTECSELM), a classspecific extreme learning machine (CSELM), which takes advantage of minority and classspecific sampling.
Other works
In [37], a lowrank regression model is proposed for feature extraction and feature selection from images without vectorization. To effectively solve the objective function, an optimizationbased alternative to Lagrangian coefficients has been developed. In [38], by extracting the features and selecting the features in a cascading manner, and the Pigeon Inspired based Optimization (PIO) method is used to select the features. In [39], a hybrid approach is performed simultaneously by reducing the majority data with the rough set theory and increasing the minority data using the SMOTE method simultaneously. These methods and methods similar to [39] and [40] are among the databased hybrid methods. In [41], a method has been designed to select a feature that emphasizes two issues. One is the problem of class imbalances and the other is the large size of the data. Also, [39] proposes a costsensitive approach in the context of concave optimization problem and proposes the solution through a Newtonianlike process. In order to prevent the explosion of data space dimensions while maintaining the statistical coherence of a part of the data set selected for teaching, [42] has developed an approach for selecting training data based on Pareto analysis performed on classification descriptors. It also provides empirical evidence that this approach retains its validity, even when compared to traditional spacereduction methods and classical machine learning algorithms.
Ref. [43] proposes a new framework that makes it possible to identify anomalous data points in large volumes of data with high dimensional problems. Authors of [44] proposed methods for reducing the number of variables that include more information and for reliable classification, as well as several methods for reducing dimensions and classification.
The tasks reviewed generally use only one feature selection or extraction operation to reduce the size. Of course, hybrid methods were also available, but their efforts are mainly aimed at combining filter, embedded and wrapper approaches, and do not combine the main categories of selection and extraction at the same time. In this paper, by introducing the proposed method in the form of an optimization problem that simultaneously solves the feature selection and extraction problem, a context for using the benefits of both approaches is provided. The proposed optimization problem also includes a costsensitive function that is designed to make the model resistant to data with imbalanced labels, while creating a balance without manipulating the data.
The term costsensitive refers to creating resistance in the feature reduction process to imbalanced data. The existence of this resistor is embedded within the proposed optimization problem. Therefore, the proposed method is used based on the feature space to solve the problem of imbalance.
The proposed method
Assume that \(X \in {\mathbb{R}}^{d \times n}\) is a data matrix that represents n data d dimensional in which \(x_{i}\) is equal to ith data point. The above data label is represented by the vector \(Y = \left\{ {y_{1} \ldots y_{n} } \right\}\) where \(y_{i} \in \left\{ {  1. + 1} \right\}\). To clarify the rest of the proposed approach, Table 1 is included which summarizes the frequently used notations. Matrix \(Z \in {\mathbb{R}}^{m \times n}\) is a reduced latent representation of Z where \(m \ll d\). In other words, the Z matrix is the result of the feature extraction operation on X, which can be generated by the following mapping:
where \(Q \in {\mathbb{R}}^{d \times m}\) is the mapping matrix and \(\varepsilon \in {\mathbb{R}}^{d \times n}\) is the reconstruction error. In order to minimize the reconstruction error, a simple solution can be the soft minimization of the following equation:
Suppose that the feature selection is done on input space through a diagonal matrix \(\sigma\) such that \(diag\left( \sigma \right)\) consists of 0 and 1 where \(\sigma_{j} = 1\), denotes that the \(j{\text{th}}\) feature is selected from a dataset and vice versa. Assuming feature selection by \(\sigma\), Eq. (2) will be written as follows:
Adding the regularization term to the above optimization problem, the following equation is obtained:
where the nonnegative parameter \(C\) indicated the penalty for the regularization term. In addition, to control the \(Q\) matrix and prevent large values, \(Q^{T} Q = I\) constraint is added to the objective function. Therefore, the following optimization problem is obtained:
In order to increase the discriminability of classes and preserving the primitive geometry structure in the reduced space \(Z\), another term is also added to objective function named as locally alignment. The following problem is attained by adding this term to an objective function. \(LA\left( Z \right)\) will be discussed later.
where \(LA\)(\(Z\)) would be as:
Minimization of Eq. (7) means that distance of reduced data \(z_{i}\) from \(k_{1}\) intraclass nearest neighbors should be decreased and the distance from \(k_{2}\) farther neighbors from the opposite class should be increased. In Eq. (7), \(z_{ij}\) denotes the \(j{\text{th}}\) nearest neighbor for \(i{\text{th}}\) data point with the similar label; and \(z^{ip}\) is the \(p{\text{th}}\) nearest neighbor for \(i{\text{th}}\) data from the opposite class. Also, the parameter \(\beta\) indicates the importance level of the second term in the equation. It must be noted that \(LA\) makes the algorithm supervised because the class labels must be known to compute intraclass or interclass neighbors.
The optimization problem is similar to the support vector machine (SVM) optimization [10] and is formulated as follows which performs simultaneous feature selection and classification.
In (8), \(\xi\) is the slack variable and 1 is a vector of ones having the length equal to \(\xi\), the matrix multiplication of \({\mathbf{1}}^{T} \xi\) denotes the summation of the \(\xi\) values. Also, \(\omega\) denotes the coefficients of separating hyperplane, \(b\) is the bias, and \(z_{ij}\) is the \(j{\text{th}}\) feature of the \(i{\text{th}}\) data. Also,
Which means the empirical mean of the second order moment of the \(j{\text{th}}\) feature in the negative and positive classes respectively. \(n_{  }\) and \(n_{ + }\) are the number of negative and positive class data, respectively.
One of the constraints in (8), is to prevent from exorbitant growth of \(\mathop \sum \nolimits_{j = 1}^{m} \nu_{j}^{ + } \sigma_{j}^{2} \;{\text{and}}\;\mathop \sum \nolimits_{j = 1}^{m} \nu_{j}^{  } \sigma_{j}^{2}\) which have the upper bound \(R_{ + }\) and \(R_{  }\) respectively. It is evident that by choosing a low value for \(R_{ + }\) and \(R_{  }\), the optimum solution for \(\sigma\) would include more zero values and consequently it results in a higher feature reduction rate.
Another constraint in the problem (8), i.e. \(\omega^{T} \omega \le 1\), is the bounding over \(l_{2}\)norm of \(\omega\) which means a regularization role. Now the final optimization problem resulting from combining the two relations (6) and (8) is as follows:
In the above optimization problem, the input data is assumed to be balanced. This means that if there are two classes of data, approximately equal amounts of data are available from each of them. Since this assumption is not always true in realworld datasets, a costsensitive function is added to reinforce the above optimization problem, such as [21]. Therefore, the following optimization problem will be resulted with the same constraints in Eq. (8):
In problem (11), the slack values for the positive and negative class data are added with different coefficients. Suppose the positive class is minor; therefore, to prevent the model from deviating towards that class, the C_{+} coefficient can be selected as a larger number than C_{−}. This means that since the slack of the positive samples will be more fined, the model will not deviate towards them.
The costsensitive optimization problem now becomes:
Solving the proposed costsensitive optimization problem, Eq. (12), leads to the fact that in addition to feature extraction, feature weighting is performed simultaneously. The object of the problem is to find \(z_{i}\), which is the result of extracting the property on \(x_{i} \sigma\). Also, since the optimization problem is cost sensitive due to the lack of proper classification of positive and negative classes, this leads the feature reduction process to the output properties that are not only suitable for the majority class but also for the minority class.
The proposed problem, causes more separation in the reduced space in two ways. One for the existence of the expression LA and the other for the existence of the constraint \(y_{i} \left( {\mathop \sum \limits_{j = 1}^{d} x_{ij} \omega_{j} \sigma_{jj} + b} \right) \ge 1  \xi_{i} .\)
Feature extraction as a solution to a reduced manifold learning optimization problem is based on error reduction and maintaining geometric relationships between data. Also, in order to select the features, optimization problems based on the minimization of the above the generalization error have been adopted. Finally, the optimization problem combined from the above two problems is solved by adding a costsensitive expression to create a balance without manipulating the data in the imbalanced data. The flowchart of the proposed approach in illustrated in Fig. 1.
The pseudo code
The steps of the optimization algorithm are denoted in Alg. 1.
Algorithm I
An iterative solution to optimize the proposed problem

Input: data matrix \(X = \left\{ {x_{1} \ldots x_{n} } \right\}\) where \(x_{i} \in {\mathbb{R}}^{d}\) \(i = 1 \ldots n\) and labels \(y_{i} \in \left\{ {  1. + 1} \right\}\) Output: dimensionality reduced data \(z_{i}\) \(\forall i = 1 \ldots n\), \(\omega\). and \(b\).

1:
Initialize \(Q\) and \(\sigma\) with random values.

2:
For \(t = 1 \ldots max\_iteration\) do

3:
Solve the problem associated with step 1 to oain \(z_{i}\) at the \(\left( {t + 1} \right){\text{th}}\) iteration by:
$$z_{i}^{(t + 1)} = \left[ {k_{1} k_{2} Q^{(t)T} Q^{(t)} + \left( {k_{1} k_{2}  \beta k_{1} k_{2} + k_{1} k_{2} C} \right)I} \right]^{  1} \left[ {k_{1} k_{2} Q^{(t)T} \left( {x_{i} \sigma^{(t)} } \right) + k_{2} \mathop \sum \limits_{j = 1}^{{k_{1} }} \left( {z_{ij} } \right)^{(t)}  k_{1} \beta \mathop \sum \limits_{p = 1}^{{k_{2} }} \left( {z^{ip} } \right)^{(t)} } \right],$$ 
4:
Solve the problem associated with step 2 to obtain \(Q\) by:
$$Q^{(t + 1)} = \left( {\mathop \sum \limits_{i = 1}^{n} \left( {x\sigma^{(t)} } \right)z_{i}^{(t + 1)T} } \right)\left( {\mathop \sum \limits_{i = 1}^{n} z_{i}^{(t + 1)} z_{i}^{(t + 1)T} } \right)^{  1} ,$$ 
5:
Solve the dual problem (12) using \(z_{i}^{{\left( {t + 1} \right)}}\) , \(Q^{{\left( {t + 1} \right)}}\)

6:
Exit, if convergence criterion meets.

7:
end for
As it may be inferred from the above algorithm, there is a loop in line 2 which is iterated max_iteration times. In the loop, \(z_{i}\) is first computed which as seen, has a constant time with respect to the number of samples and features. Therefore, calculating the whole Z has a time complexity of O(n). Calculating Q is the most complex stage of the approach. The first parenthesis on Q is calculated in O(nm) while the second parenthesis has the same complexity. Therefore, the complexity of finding Q is O(2nm). For solving Eq. (12), we have six summations which are computed over n. Therefore, considering approximately constant operations in each summation, the computational complexity of the last stage in O(cn).
Having all these complexities for max_iteration times, the total time complexity of the algorithm will be O(max_iteration * (n + 2 nm + cn)) in which n is the number of samples and m is the number of final extracted features. Therefore, as seen, the time complexity of the algorithm is a linear function of n and m which is not much as compared to similar approaches and implementing the approach in parallel will decrease the run time of the algorithm.
Experiments
In this section, we simulate the proposed method and evaluate the performance of the proposed algorithm. This test needs criteria to be used to measure the performance of the algorithm. Here, after stating the conditions and implementation environment, a description of the evaluated data set, setting the algorithm parameters, the desired criteria are introduced, and then, the efficiency of the proposed method in terms of these different criteria is compared with other methods.
Experimental setup
In order to evaluate the effectiveness of the proposed method, the data collections from the UCI machine learning repository, highdimensional microarrays datasets as well as imbalanced datasets from the KEEL repository have been used.
Attempts are made to use data that is large in size so that a number of appropriate features are reached when the dimension reduction is performed. Also, the use of data with more imbalance in their labels will lead to a better evaluation of the performance of the proposed method in the face of imbalanced data. Specifications of the test classification datasets are mentioned in Table 2.
The size of the datasets in this article varies from 62 to 10,000, the number of their features varies from 4 to 7129, and the number of classes from 2 to 101. The imbalance ratio in Table 2 is calculated by the following equation [45]:
Also in Table 3, the number of selected features for each data set is specified. As a heuristic, the integer value of half the number of attributes in each data set is used as the number of selected features.
Evaluation method
To evaluate the results of the approach, the multiclass linear SVM classifier is used on the reduced data. Kfold cross validation method is used to perform the experiments. MATLAB libsvm library is used for SVM implementation and evaluation and linear kernel with C = 1 is used as the SVM settings. The onevsone model is used to classify the proposed method for multiclass data.
Parameter settings
The best values of the parameters are obtained by the Particle Swarm Optimization (PSO) algorithm on the validation data set. Given that there are different parameters in the proposed method, the parameter values β is set as 0.0001, C is 0.1 and the number of neighbors are considered as 3, and also the values of the superparameters of the problem is set as 20 and 50 in the experiments. The values for all hyperparameters used in the proposed method are presented in Table 4.
Experimental design
MATLAB 2018 software has been used to implement the proposed algorithm. The tests were also evaluated on a PC with Intel Core i3 processor, 4 gigabytes of RAM, and Windows 8.1 operating system.
Evaluation criteria
To compare the performance of the proposed method Fscore and accuracy criteria are used.
where TP denotes true positive, TN is true negatives, FP is false positives, and FN denotes false negatives.
Two other criterions are used in the experiments which are:
Due to the imbalance in the data set, one of the RCL and PRC criteria may be too high or too low, so their geometric mean is considered in our experiments.
Feature reduction and classification methods
Results of classification operations after feature reduction were studies using the proposed costsensitive method compared with SMVMLLA [15], GMEB [10], IGIS [20], WJMI [46] And IWFS [21] methods.
As in Table 5, the proposed method has performed better than the other methods in most cases in term of the accuracy. The numbers in parenthesis denote the rank of the approach on each dataset and the final row denote the average rank of each method. As seen in the table, the average rank of the proposed method has the highest value among other methods. After the proposed method, the GMEB method was able to obtain a better ranking, and finally the IGIS method is lower than the others. Not available (NA) values in the following tables, is due to the fact that the corresponding datasets are not evaluated in the reference articles.
In Table 6, the proposed costsensitive method has the highest average ranking. The IWFS method average rank is 0.92 worse than the proposed method and is ranked second in total.
Again in Table 7, the proposed method has a better result than other methods. The proposed method has the highest average rank value. In the three datasets Kddcupbufferoverflowvsback, Kddcuplandvssatan, Kddcuprootkitimapvsback, the proposed method had similar performance with GMEB.
As shown in Table 8, the performance of the proposed method is better in most cases. Comparing the proposed costsensitive method with other methods, the GMEB approach has superior performance on iris dataset, Kddcuprootkitimapvsback and Dermatology6 while the SMVMLLA approach cannot compete with the two other approaches. However, on several data sets, the proposed method has a high performance and the average rank of the method has the highest value.
In Table 9, the methods are evaluated with the gmean criterion. This criterion is one of the most accurate evaluation criteria for imbalanced data sets. As seen in the results, the costsensitive method performed better on most datasets and the proposed approach has again the highest average rank.
In Table 10, the average execution time of our method compared to other approaches is shown. As seen and could be predicted, the proposed approach has a relatively high execution cost on most of the datasets as compared to other evaluated approaches. However, the approach is still better than the IGIS method on Movement_libras, Colon, and Iris datasets. Also, on the Caltch101 (with 784 original features), Mnist (with 5469 original features) and DLBCL77 (with 784 original features), which are among the highest dimensional evaluation datasets, our method has the best convergence time than all other methods. These observations show that when it comes to high dimensional data, the other compared approaches degrade and our algorithm is more effective than them. But with lower dimensional data, the proposed approach has higher running time which is not considerable when the whole execution time is in the scale of 1 or 2 s. As shown in Table 10, GMEB approach has the lowest execution time on 14 datasets most of which has the original dimensionality of 100 or less. Therefore, these evaluations also demonstrate the effectiveness of the approach in high dimensionality.
To better demonstrate the results of the experiments, the performance of the approaches versus the percentage of the selected features are depicted in Figs. 2, 3, 4, 5, 6 and 7.
On the deramatology6 dataset, the proposed costsensitive method has performed better than other methods from the beginning. It still performs well when the number of features increases. With this interpretation, it can be concluded that our method has selected relatively good features that show a better result. Also, on the ionosphere dataset, the costsensitive method initially performed better than other methods with less specificity. In the case of the IGIS method, it performs badly at the beginning (i.e. when the number of selected features is low) and then gets better. This means that when a feature increases, it works well. It can be concluded that the features it has selected must not have been good features. On the other hand, if 20% of the features are selected using the proposed approach, the performance is well and stable. With the SMVMLLA approach, the diagram slope has gradually increased, and then suddenly drops strangely. The performance contour of the proposed method is above the others from the beginning, and maintains the performance which shows the superiority of the approach.
Conclusion
In this paper, a hybrid was proposed in order to reduce the data dimensionality, which combines feature selection and feature extraction in the context of an optimization problem solving while creating a balance without manipulating the data. In this method, it uses the advantages of feature selection and feature extraction together. In feature extraction, it tries to solve a Manifold learning optimization problem and does feature selection as an optimization problem based on minimization of the general error boundary. In evaluations the accuracy and fscore results are reported on the test data. Comparison results of the proposed method with other methods on 21 datasets from the UCI machine learning repository, microarrays and highdimensional datasets as well as imbalanced datasets from KEEL repository are reported. The evaluations indicate the superiority of the proposed model over other methods. As the future works, evaluating the proposed approach on real world problems and applications is suggested.
Availability of data and materials
The datasets and codes will be available after acceptance.
Notes
Principal component analysis.
joint mutual information.
References
Rakkeitwinai S, et al. New feature selection for gene expression classification based on degree of class overlap in principal dimensions. Comput Biol Med. 2015;64:292–8.
Kabir MM, Shahjahan M, Murase K. A new local search based hybrid genetic algorithm for feature selection. Neurocomputing. 2011;74(17):2914–28.
Vieira SM, Sousa JM, Runkler TA. Two cooperative ant colonies for feature selection using fuzzy models. Expert Syst Appl. 2010;37(4):2714–23.
Zebari R, et al. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J Appl Sci Technol Trends. 2020;1(2):56–70.
Cheng Z, Lu Z. A novel efficient feature dimensionality reduction method and its application in engineering. Complexity. 2018. https://doi.org/10.1155/2018/2879640.
Zebari DA, et al. A simultaneous approach for compression and encryption techniques using deoxyribonucleic acid. In: 2019 13th international conference on software, knowledge, information management and applications (SKIMA). IEEE; 2019.
Ayesha S, Hanif MK, Talib R. Overview and comparative study of dimensionality reduction techniques for high dimensional data. Inf Fusion. 2020;59:44–58.
AbdAlsabour N. On the role of dimensionality reduction. J Comput. 2018;13(5):571–9.
Verleysen M, François D. The curse of dimensionality in data mining and time series prediction. In: International workconference on artificial neural networks. Springer; 2005.
Peleg D, Meir R. A feature selection algorithm based on the global minimization of a generalization error bound. In: Advances in neural information processing systems. 2005.
Elhadad MK, Badran KM, Salama GI. A novel approach for ontologybased dimensionality reduction for web text document classification. Int J Softw Innov. 2017;5(4):44–58.
Luo W. Face recognition based on laplacian eigenmaps. In: 2011 International conference on computer science and service system (CSSS). IEEE; 2011.
Abdullah A, et al. Sketching, embedding and dimensionality reduction in information theoretic spaces. In: Artificial intelligence and statistics. PMLR; 2016.
Wang Y, Li T. Local feature selection based on artificial immune system for classification. Appl Soft Comput. 2020;87: 105989.
Zhao Y, et al. Multiview manifold learning with locality alignment. Pattern Recogn. 2018;78:154–66.
Xu J, et al. Feature selection based on sparse imputation. In: The 2012 international joint conference on neural networks (IJCNN). IEEE; 2012.
Shahee SA, Ananthakumar U. An effective distance based feature selection approach for imbalanced data. Appl Intell. 2020;50(3):717–45.
Chenxi H, et al. Sample imbalance disease classification model based on association rule feature selection. Pattern Recognit Lett. 2020;133:280–6.
Bennin KE, et al. Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng. 2017;44(6):534–50.
Nakariyakul S. Highdimensional hybrid feature selection using interaction informationguided search. Knowl Based Syst. 2018;145:59–66.
Zeng Z, et al. A novel feature selection method considering feature interaction. Pattern Recogn. 2015;48(8):2656–66.
Qi X, et al. WJMI: a new feature selection algorithm based on weighted joint mutual information. In: 2015 3rd international conference on mechatronics and industrial informatics (ICMII 2015). Atlantis Press; 2015.
Japkowicz N. The class imbalance problem: significance and strategies. In: Proc. of the Int’l Conf. on artificial intelligence. 2000. Citeseer.
Hart P. The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory. 1968;14(3):515–6.
Tomek I. Two modifications of CNN. IEEE Trans Syst Man Cybern. 1976;6:769–72.
Yen SJ, Lee YS. Clusterbased undersampling approaches for imbalanced data distributions. Expert Syst Appl. 2009;36(3):5718–27.
García S, Herrera F. Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput. 2009;17(3):275–306.
Chawla NV, et al. SMOTE: synthetic minority oversampling technique. J Artif Intell Res. 2002;16:321–57.
Han H, Wang WY, Mao BH. BorderlineSMOTE: a new oversampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer; 2005.
Maciejewski T, Stefanowski J. Local neighbourhood extension of SMOTE for mining imbalanced data. In: 2011 IEEE symposium on computational intelligence and data mining (CIDM). IEEE; 2011.
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safelevelsmote: safelevelsynthetic minority oversampling technique for handling the class imbalanced problem. In: PacificAsia conference on knowledge discovery and data mining. Springer; 2009.
Ramentol E, et al. SMOTERS B*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced datasets using SMOTE and rough sets theory. Knowl Inf Syst. 2012;33(2):245–65.
Cheng F, et al. Large costsensitive margin distribution machine for imbalanced data classification. Neurocomputing. 2017;224:45–57.
Xiao W, et al. Classspecific cost regulation extreme learning machine for imbalanced classification. Neurocomputing. 2017;261:70–82.
Du G, et al. Joint imbalanced classification and feature selection for hospital readmissions. Knowl Based Syst. 2020;200: 106020.
Raghuwanshi BS, Shukla S. SMOTE based classspecific extreme learning machine for imbalanced learning. Knowl Based Syst. 2020;187: 104814.
Yuan H, et al. Lowrank matrix regression for image feature extraction and feature selection. Inf Sci. 2020;522:214–26.
Buvana M, Muthumayil K, Jayasankar T. Contentbased image retrieval based on hybrid feature extraction and feature selection technique pigeon inspired based optimization. Ann Roman Soc Cell Biol. 2021;25:424–43.
Wang Q. A hybrid sampling SVM approach to imbalanced data classification. In: Abstract and applied analysis. 2014. Hindawi.
Prachuabsupakij W. CLUS: a new hybrid sampling classification for imbalanced data. In: 2015 12th international joint conference on computer science and software engineering (JCSSE). IEEE; 2015.
Maldonado S, López J. Dealing with highdimensional classimbalanced datasets: embedded feature selection for SVM classification. Appl Soft Comput. 2018;67:94–105.
Roccetti M, et al. An alternative approach to dimension reduction for pareto distributed data: a case study. J Big Data. 2021;8(1):1–23.
Thudumu S, et al. A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data. 2020;7(1):1–30.
Badaoui F, et al. Dimensionality reduction and class prediction algorithm with application to microarray Big Data. J Big Data. 2017;4(1):1–11.
Amin A, et al. Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEE Access. 2016;4:7940–57.
Qi X, et al. WJMI: a new feature selection algorithm based on weighted joint mutual information. 2015.
Acknowledgements
All the contributors are mentioned in the manuscript.
Funding
The authors approve that there was no funding in preparing the manuscript.
Author information
Affiliations
Contributions
MF performed the main implementations and evaluations. MHM and YF were thesis supervisor and advisor respectively and the idea was originated by them. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This manuscript does not include studies involving human participants.
Consent for publication
This manuscript does not contain any individual person’s data.
Competing interests
The authors approve that there is no conflict of interest in the manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fattahi, M., Moattar, M.H. & Forghani, Y. Improved costsensitive representation of data for solving the imbalanced big data classification problem. J Big Data 9, 60 (2022). https://doi.org/10.1186/s4053702200617z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4053702200617z
Keywords
 Feature selection
 Feature extraction
 Imbalanced data
 Big data classification
 Cost sensitive
 Optimization