A multi-manifold learning based instance weighting and under-sampling for imbalanced data classification problems

Under-sampling is a technique to overcome imbalanced class problem, however, selecting the instances to be dropped and measuring their informativeness is an important concern. This paper tries to bring up a new point of view in this regard and exploit the structure of data to decide on the importance of the data points. For this purpose, a multi-manifold learning approach is proposed. Manifolds represent the underlying structures of data and can help extract the latent space for data distribution. However, there is no evidence that we can rely on a single manifold to extract the local neighborhood of the dataset. Therefore, this paper proposes an ensemble of manifold learning approaches and evaluates each manifold based on an information loss-based heuristic. Having computed the optimality score of each manifold, the centrality and marginality degrees of samples are computed on the manifolds and weighted by the corresponding score. A gradual elimination approach is proposed, which tries to balance the classes while avoiding a drop in the F measure on the validation dataset. The proposed method is evaluated on 22 imbalanced datasets from the KEEL and UCI repositories with different classification measures. The results of the experiments demonstrate that the proposed approach is more effective than other similar approaches and is far better than the previous approaches, especially when the imbalance ratio is very high.


Introduction
Imbalanced learning is one of the main challenges of classification in real-world problems.This challenge occurs when the number of examples from one class (called the majority class) is greater than the number of samples from the other class (called the minority class).Imbalance problem may be inevitable and happens when it is difficult to collect minority class samples, and the majority class samples are more abundant [1,2].Classification problems such as fraud detection [3], image segmentation [4,5], intrusion detection [6,7], disease detection [8][9][10], etc. are mostly imbalanced.Dealing with this issue is challenging because traditional classification approaches have presuppositions.Their default is that the training samples are equally distributed among the classes.
Therefore, the majority class prevails over the minority class, and the minority examples are ignored.The inherent characteristics of imbalanced problems, such as overlapping and inseparability of classes [11,12], increase the complexity of such data and weaken the classification performance [1,2,13].
Big data is the term used to describe datasets that are very huge, complicated, and contain a tremendous amount of information.In this kind of enormous dataset, the minority class may nevertheless be represented by a sizable number of examples.Because there are a lot of minority class samples even if they only make up a small fraction of the total dataset, handling imbalance becomes more difficult.In large data analytics, dealing with class imbalance becomes essential because disregarding it may result in biased model training.Large dataset processing and machine learning model training can be timeand resource-intensive.By balancing the classes, the training process can be improved, becoming more efficient and controllable.
Many studies have been done on the imbalanced data problem, and various techniques have been proposed.These techniques are divided into four main categories, which are algorithm-driven, cost-sensitive approaches, data-driven, and ensemble approaches.Algorithm-based methods try to adapt classifiers to imbalanced problems.These approaches try to modify the learning stage and accept the issue of imbalance in the data.In cost-sensitive methods, higher penalties are imposed for misclassifying minority samples, and these methods try to minimize the final penalty [14,15].
Unlike other techniques, data-driven methods do not depend on classification and operate completely independently.These methods are usually done in the preprocessing stage.These methods benefit from two under-or over-sampling techniques or both and try to create a relative balance on imbalanced data.It seems that under-sampling techniques are more popular than over-sampling techniques because over-sampling techniques cause over-fitting.In ensemble approaches, several classifications are used simultaneously, and learning is done with the help of a voting technique or by combining the scores of the classifications.There is still a fundamental challenge with this type of approaches.The challenge here is how to combine the optimal classifications to increases the learning time [1,2,13].
Due to justifications such as applicability, generalizability and classifier independence, in this paper, an under-sampling approach is proposed.Two basic problems of undersampling techniques can be pointed out, which are still a challenge among researchers.The main problems are: how many and which samples should be removed from the majority class?In this research, it is tried to overcome these two problems by introducing a new under-sampling method.The proposed method is based on the hypothesis that manifolds are the structures that can reflect the density and neighborhood properties of data.But since we are not sure that which manifold best suits a specific problem and dataset, a multi-manifold learning is proposed in this paper, which assesses the optimality of manifolds based on a proposed information loss-based heuristic.
The optimality indexes are used in a weighted combination of centrality and marginality criteria for the samples.The proposed approach is supposed to assign weights that determine the degree of importance of the samples from the majority class.A sequence of weights is created according to the relative importance of the samples.Then, the most insignificant samples are gradually removed from the majority class.
Finally, combining the most important data of the majority class with the samples of the minority class, the training dataset is created.The proposed method is evaluated on 22 imbalanced datasets from KEEL and UCI repositories.The main contributions of this research are summarized as follows: • The samples of the majority class are weighted according to the multi-manifold approach, based on a weighted combination of the centrality and marginality of the samples on each manifold.• The weights are sorted in descending order.Less important samples are gradually removed from the majority class.The remaining samples can largely represent the distribution of the data.
The proposed approach reduces the overlap of minority and majority classes and increases class separability, which causes better classification performance.A simplified graphical abstract of the proposed method is shown in Fig. 1.
The rest of the paper is organized as follows: The next section gives a brief overview of under-sampling methods and related works."Definitions and background" section introduces some definitions and the required background for better understanding the proposed approach."Proposed method" section explains the proposed method in detail.The experimental results and discussions are in "Experimental Results and Fig. 1 A simplified flowchart of the proposed multi-manifold approach for under-sampling discussion" section.Finally, in the last section, the conclusions and the future research directions are discussed.

Related works
Under-sampling and over-sampling methods are performed in the preprocessing stage.In oversampling methods, minority samples increase.In these methods, new samples are usually created around the minority samples, or the minority samples are repeated again.Increasing the sample can cause an overfitting problem [16].
Researchers introduced ensemble approaches based on bagging, such as the Overbagging method [17].Various versions of the SMOTE technique, which is an oversampling method, are presented [18].The SMOTE technique synthesizes a number of new samples with the help of k nearest neighbors by randomly choosing a neighbor among the minority samples for each sample.Along with its advantages, the SMOTE algorithm will include problems such as overgeneralization and variation in convergence [19].Researchers presented the SMOTE-Bagging [17] and SMOTE-Boost [20] methods, which are a combination of SMOTE, bagging, and boosting.These methods have a high level of computational complexity, although they are significant in terms of performance.In addition, these methods cause overfitting and have many parameters to adjust [21,22].
There are numerous under-sampling methods in the literature.Random under-sampling deals with the random removal of majority class samples.This may remove useful instances.This method is combined with ensemble approaches [23].The under-bagging method is a combination of the ensemble method based on bagging and the random under-sampling method [24].Seiffert et al. presented the RUS-Boost method, which is a combination of random under-sampling and boosting approaches [25].Researchers proposed de-under-sampling methods called Near Miss, which consider the elimination of majority samples according to their distance from minority samples [26].
Under-sampling methods can be divided into two main groups, including methods based on KNN (k nearest neighbors) [27][28][29] and methods based on k-means [30][31][32].Some under-sampling methods eliminate the majority of samples based on the information obtained from the nearest neighbors of the samples.The purpose of these methods is to remove samples that are located in marginal areas or are noisy and redundant.Kubat and Matwin [27] presented an under-sampling method called the One Side Selection method (OSS), which is one of the applications of Tomek links [33].The samples in Tomek's links are considered marginal or noise samples.In the condensed nearest neighbor (CNN) method, if the label of a sample is the same as the label of its nearest neighbor (1NN), this sample is considered redundant [34].In the OSS method, a large number of majority samples that are borderline, noisy, or redundant are removed.Removing a large number of samples reduces the performance of the classifier.
Laurikkala proposed the Neighborhood Cleaning Rule (NCL) to remove the samples of the majority class [28].This method uses the Edited Nearest Neighbor (ENN) method [35] to eliminate the samples.In ref. [28], samples whose marginal score is more than two are eliminated.Also, samples are removed if one of the three nearest neighbors is from the minority class.In ref. [13], an under-sampling method is proposed that uses the density of the data to progressively remove data points from the majority class.Two factors are proposed to measure the degree of importance degree of each instance.Furthermore, the optimal under-sampling level is determined progressively.
In addition to eliminating the majority samples, Kang and his associates [29] also eliminated the noise in the minority class.They separated the minority samples into three groups: noisy, informative, and relatively informative.As a result, the classifier's performance will improve after the noisy minority samples have been eliminated.The exclusion of minority samples from the algorithm could lead to the failure of the classifier, which makes figuring out the value of the parameter k crucial.
Yang et al. proposed an under-sampling method that uses the natural neighborhood graph (NaNG).With the help of this graph, they are able to classify the training samples into central, marginal, and noise samples.They are able to under-sample by removing noisy and redundant samples.They called their sample reduction method NNGIR.One of the strengths of their methods is that they are non-parametric, increasing the reduction rate and improving prediction accuracy.The disadvantages of their method are the dependence on parameters and relatively low accuracy [36].Hamidzadeh et al. [37] presented the LMIRA under-sampling method.They removed the non-marginal samples and kept the marginal samples.They considered their method a constrained binary optimization problem and used the filled function algorithm to solve it.
Pang and his colleagues [38] introduced a new secure under-sampling method called SIR-KMTSVM.In ref. [38], most redundant samples are removed from both the majority and minority classes.One of the advantages of their method is its use for large-scale problems.The disadvantages include high computational complexity and the removal of informative samples.Hamidzadeh and colleagues [39] proposed an under-sampling method that solves the instance reduction problem as an unconstrained multi-objective optimization problem.They designed a weighted optimizer and searched for the appropriate samples with the help of chaotic krill-herd evolutionary algorithm.The advantage of the method is the improvement in accuracy, geometric mean, and calculation time.The main weakness of the method is that it can only be applied to normal-sized datasets.
Another common under-sampling technique is clustering, which helps to have a logical training set [40].Researchers named Chen and Shiu [30] put the majority of samples in k different clusters and used the k-means clustering method.Then, by combining each cluster from the majority class with the minority samples, they created new data groups that are more balanced.Each data group is trained separately and builds a classification model.Finally, all models are aggregated together to predict new samples.The weakness of this algorithm is that it has not determined how the value of the parameter k is determined.
Yen and Lee [32] used clustering to propose their under-sampling method.They identified representatives of the majority class to create new training data.They first divided the entire training data into k clusters.They performed clustering based on a ratio of majority samples to minority data.The weakness of the algorithm is that it does not specify how to value the parameters k and m.Lin and his colleagues [31] presented a new under-sampling method that clusters the majority samples with the k-means method, where the value of k is equal to the number of minority samples.
In addition to the mentioned methods, some researchers [41][42][43] used k-means clustering before applying the under-sampling method to determine the type of the majority sample in terms of noise, redundancy, or marginality.Hamidzadeh and his colleagues [44] introduced an under-sampling method based on hyper-rectangle clustering and called their method IRAHC.They removed the central samples and kept the marginal and near-marginal samples.
Huang and colleagues [45] introduced a neural network algorithm (NN_HIDC) for the classification problem of highly imbalanced data.They proposed the generalized gradient descent algorithm.This algorithm is used in re-sampling and re-weighting methods in neural network.They extended the locally controllable bound to reduce the insufficient empirical representation of the positive class.The advantages of this algorithm can be mentioned in its use for any very imbalanced data, and the weaknesses of the algorithm are that the extended gradient of the positive class can only reach the local border and gradient measurement is required for all samples in each iteration.Koziarski [46] introduced a radial under-sampling (RBU) method in the classification problem of imbalanced data.RBU uses the concept of mutual class potential in the under-sampling method.The advantages of the method include reducing the time complexity compared to RBO, being effective on the difficult minority class that includes small disjunct, outliers and small number of samples, overcoming the limitations of neighborhood-based methods.
Sun et al. [47] introduced a radial under-sampling approach with adaptive undersampling ratio determination.They called their algorithm RBU-AR.This method determines the appropriate under-sampling ratio according to the complexity of the class overlap and does not use the default value of one or trial and error.The advantages of their approach are better performance in high overlap.The weakness of their approach is the lack of application in multi-class problems.Mayabadi and colleagues [48] proposed two density-based algorithms to remove overlap, noise and balance between classes.The first algorithm uses under-sampling technique and the second algorithm uses undersampling and over-sampling simultaneously.Their method removes high density samples from the majority class.The advantages of their algorithms are maintaining the class structure as much as possible and improving performance.
Vuttipittayamongkol and Elyan [49] proposed an under-sampling method to solve binary data imbalance problem by removing overlapping data.They focused on the detection and elimination of overlapping majority samples.The advantages of their algorithm are preventing information loss and improving the sensitivity criterion.The weak points of the algorithm are how to set the value of k in the k-NN law and the failure to examine multi-class problems.Nwe and colleagues [50] introduced an effective undersampling method using k-nearest-neighbor-based overlapping samples filter to classify imbalanced and overlapping data.The advantage of their algorithm is to prevent information loss.The weak points of their algorithm are setting the value of k, not checking high dimensions and not checking multi-class problems.
Zhai and colleagues [51] proposed two diversity over-sampling methods, BIDC1 and BIDC2, which were based on generative models.BIDC1 and BIDC2 methods use extreme machine autoencoder and generative adversarial network, respectively.Among the advantages of their methods, we can mention the simple but effective idea, improving performance in data with low and high imbalance ratio, suitable for different practical scenarios, creating variety in over-sampling and preventing overlap of classes.
The weaknesses of their method are the lack of scalability in big data and the difference between the original and generated data distribution.Table 1 summarizes of the strengths and weaknesses of the some of the main approaches in this regard.

Definitions and background
For better explanation of the proposed approach, in this section some primary definitions and backgrounds are explained.

Definitions
Manifold: Manifold refers to any process, curve, or complex nonlinear shape.In fact, in the manifold learning method, the system's intrinsic parameters are identified, and the entire data set is placed on a manifold that expresses the intrinsic relationship between the data in a space with less dimension.
Multi-manifold learning: In pattern recognition, we often encounter situations where the data set is not on a manifold.In other words, if the dataset has several classes, the data for each class will have a separate manifold.
Traditional Degree of centrality: If the data is in the center of the class in such a way that its label is the same as the label of its K c nearest neighbors, then it has a degree of centrality.The degree of centrality of a sample is greater when the number of neighbors with the same label is greater than the number of neighbors with the opposite label, or most of its neighbors are of the same class.
Traditional Degree of marginality: If a data point is on the edge or border of the class and its label is not the same as all the neighboring samples or some of its K m nearest neighbors, then the degree of marginality can be considered for this data point.The degree of marginality of a sample is higher when the number of neighbors from the opposite class is greater than the number of neighbors from the same class.

Manifold learning
The purpose of manifold learning algorithms is to map a set of data with high dimensions to a set of data with smaller dimensions in such a way that the distance between samples in the lower dimensional subspace is close to the distance between samples in the original space.Assume that x i is data in a high-dimensional space, and data set X = (x 1 , x 2 , . . ., x n ) ∈ R n×D represents n data in a space with D dimensions.Manifold learning methods seek to represent this set of data in a space with lower dimensions, d, which is much lower than the dimensions of the data in the original representation space, i.e. d < < D. Supposing y i as a data in the lower dimensional space, the corresponding data set in the low-dimensional space can be expressed as Y = {y 1 , y 2 ,…,y n } ∈ R d×n .In this way, manifold learning is a process that calculates Y while maintaining the inherent connection of data, in such a way that the manifold resulting from Y in a low-dimensional space is the most similar to the manifold resulting from X in a high-dimensional space.
Manifold learning approaches are divided into different aspects of view but here, since we have a concern for computational complexity, we have limited the employed algorithms to linear unsupervised manifold learning methods.Among the main characteristics of linear manifold learning methods that make them appropriate for the approach,  is their out-of-sample mapping property.It means that they can map the test data to a low-dimensional space using the mapping matrix obtained from the training data.In the following, the manifold learning methods included in the proposed approach are briefly introduced.

Principal component analysis (PCA)
Principal component analysis is one of the most common global and linear methods of manifold learning and dimensionality reduction.The main idea of PCA is to find the linear subspace in the low-dimensional space that best fits the scatter of the data in the high-dimensional space.By defining the covariance matrix of the data in the highdimensional space, Cov(X), and due to the non-negativity and symmetry of the covariance matrix we have: , it is proved that λ i represents the data scatter after a linear mapping byU PCA .As a result, data in the lower dimensional subspace is as follows

Neighborhood preserving embedding (NPE)
The NPE algorithm is one of the popular local methods in manifold learning.This algorithm includes three steps.The first step is to determine the neighbors of each data point.The second step is to form the neighborhood graph matrix, W, and the third step is to calculate the transformation matrix, U NPE , using W, after solving the following convex optimization problem.
where M = (I N − W) T (I N − W) .After finding the optimal solutions for U NPE , any data point x can be linearly mapped to the new subspace y using y = U T NPE x.

Locality preserving projection (LPP)
LPP manifold learning is a local method that, again includes the three main steps of neighbor finding, graph formation, and embedded data extraction.Determining the neighbor and how to form the LPP manifold graph are completely the same as the other local manifold learning methods and it differs from the other methods only in data extraction step.In fact, LPP manifold learning is a linear learning method in which the data mapping matrix from high-dimensional space to low-dimensional space is obtained from Eq. ( 4): where U LPP is the mapping matrix, L = O-W, W is the local manifold graph and O is the diagonal matrix with diagonal elements equal to j w ij .In this method, the map- ping matrix can be calculated as an eigenvalue problem.After calculating the mapping matrix, the data representation in the low-dimensional space will be Y = U T LPP X.

Proposed method
The logic of the proposed approach As discussed, every subspace of data can be expressed by a manifold.The problem is that it is not possible to find out which manifolds the data points obey in each subspace or which manifolds the distribution of the data sample is based on.On the other hand, the data structure may be so complex that a concrete manifold is not appropriate.Therefore, we use an alternative method.Instead of having multiple manifolds where each manifold represents a part of the data, we choose multiple manifolds to represent all the data samples, but according to the weight or the degree of suitability in maintaining the local neighborhood structure, optimality weights are assigned to each manifold.Consider Fig. 2. The graph in which we have a series of data points is the red dotted curve, and we don't know what their structure is.Instead of (2) Y = U T PCA X (3) considering a complex non-linear function (manifold) for this data, we combine several linear functions (orange curves) and assign a combination weight to each manifold based on the degree of similarity to the structure of the whole data.We consider this linear combination of simpler manifolds as a suitable approximation of a more complex function.
In this paper, data importance weighting considers novel measures of the marginality and centrality of data on the data manifolds and then scores the data points based on a weighted combination of these measures.The algorithm gradually removes the data samples from the majority class until a definite termination condition is met.To measure and optimize the manifolds, a distance-based information loss heuristic is proposed.
In the proposed method, different manifolds of data are extracted and used to select the neighbors of the sample that belong to the majority class.For this purpose, several manifold learning approaches, namely principal component analysis (PCA), neighborhood preserving embedding (NPE), and locality preserving projection (LPP), are applied.
Mapped majority class data, X N , on the extracted manifolds are denoted by Y N .The three manifolds are trained in parallel.For each of the mentioned manifolds, M i , a coef- ficient α(M i ) is calculated, which indicates the optimality of the manifold.Instance weighting is done based on two criteria of centrality and marginality on the extracted manifolds, separately.The final centrality and marginality criterion for the sample x i is obtained from the α(M i ) weighted combination of centralities and marginalities obtained on each manifold M i .In other words, Centrality(x i , M i ) , which expresses the centrality degree of x i on manifold M i , and Marginality(x i , M i ) , which expresses the marginality score of x i on M i , are weighted by the parameter α(M i ) , to construct the final score.Then the samples are sorted based on their centrality and marginality degrees and the unnecessary samples are excluded using an iterative strategy.The following sections, explain the approach in more detail.

Multi-manifold learning approach
Assume that X = x l 1 , x l 2 , . . ., x l n ∈ R n×D refers to a set of n data points in a space with dimension D, where l(x i ) is the class label of x i and i = {1, 2, . . ., n} .As stated before, a sub- set X N from X which correspond to the larger class, is considered a the majority samples.In the multi-manifold learning approach, several manifolds are trained on X N .A coefficient Fig. 2 The logic of the proposed multi-manifold learning method based on the weighting of linear manifolds α(M i ) is calculated for each manifold, M i , which aims to indicate the optimality of that manifold for X N .
In the initial experiments, supervised nonlinear manifold learning methods including Neighborhood Component Analysis (NCA), Maximally Collapsing Metric Learning (MCML) and Large-Margin Nearest Neighbor Metric Learning (LMNN) were used in the proposed approach, but due to the higher complexity and more execution time, the continuation of experiments with these manifolds was abandoned.The execution time of supervised manifolds, including NCA, MCML, and LMNN, increases greatly when the number of dataset samples is close to or more than 1000.Therefore, in this paper, unsupervised manifold learning approaches such as PCA, NPE, and LPP are investigated.

Manifolds optimality determination
In the second step, the manifolds of the majority class are assessed to see if they fit the neighborhood structure of the class.The goal is to give higher score to the manifolds that best fit the data of the majority class.The idea for this manifold weighting is simple.After the mapping of the original data, X N , there will be a distance between the original data and the mapped samples, Y N .Here, an information loss criterion which is denoted as the dis- tances between the initial data points and their mappings is used according to Eqs. ( 5) and ( 6) to score the manifolds.
Set of new data points Y N in the latent space is obtained by mapping the majority sam- ples X N onto the manifold M i .The set of Y N will be obtained using a linear transformation like Y N = U X N , where U is the mapping matrix.Then, mapping distance, i.e., the distance between the points of X N and their corresponding latent representation Y N , is calculated for each manifold M i according to Eq. ( 5).The smaller the distance, the better is the mani- fold.Suppose that the number of data samples equal n c .If the sum of distances is divided by the number of samples, and the average distance is obtained.Each manifold M i can be weighted according to the inverse value of the average distances according to Eq. ( 5).The higher the value of α(M i ) , the better this manifold has preserved the neighborhood struc- ture of the data.

Weighted combination of centrality and marginality in the multi-manifold approach
Instance selection is based on two criteria of centrality and marginality.The combinatorial criteria of centrality and marginality for each data sample x N i is calculated based on Eqs. ( 7) and (8).Equation (7) denotes the degree of centrality for sample x N i which is obtained from the weighted combination of centralities of the data point over the learned manifolds (i.e.PCA, NPE and LPP).Equation (8) shows the marginality degree for sample x N i which is obtained from the weighted combination of marginalities obtained over the mentioned manifolds. (5)

Gradual under-sampling of data
In the sample reduction stage, first the marginal samples which may be outliers or noise samples, and then the central samples are gradually removed from the majority class with a specific reduction step.The relation for calculating the weight of each sample from the majority class can be written according to Eq. ( 9).This relationship means that the coefficient of sample x N i is obtained from the linear combination of centrality and marginality degrees.
After calculating the weight for all samples in the majority class, a sequence of weights is created.The sequence of the weights is sorted in descending order and gradually remove the majority samples with a specific step.A high value for the sample weight, W x N i , means that the sample tends to be an outlier and is a good choice to be removed.By removing a portion of the data (i.e., 5 or 10 percent in the experiments) as marginal data, the overlapping of the majority and minority classes will decrease.On the other hand, by removing marginal samples from the majority class, it helps to better separate the majority and minority classes.This process continues until the size of the minority and majority classes is equal or the F-measure on the validation set starts reducing.Figure 3 shows the algorithm of the proposed method.Figure 4 shows the flowchart of the proposed method along with all the calculation steps.

Experimental results and discussion
In this section, many experiments have been conducted with the aim of comparing the proposed multi-manifold approach with other methods.For example, the proposed multi-manifold approach is compared with the single-manifold approaches of PCA, NPE and LPP.Also, the proposed method is compared with RUS, NCL [28], OSS [27], CNN [34], ENN [35], CBU [31], and PUMD [13] under-sampling approaches using support vector machine (SVM), k nearest neighbors (kNN) and classification and regression trees (CART) with a 10-fold cross evaluation scheme and in 5 repetitions.
These evaluations are performed on KEEL and UCI datasets based on various efficiency criteria.The mentioned methods have been chosen for comparison because (8) i is a noise sample they are among the most common under-sampling methods in literature reviews.Also, a non-parametric Wilcoxon signed rank test is used for statistical evaluation of the results.The details are explained in the following sections.

Datasets
In this research, 22 datasets are used in the experiments.These datasets are standard datasets and are usually used in the evaluation of the imbalanced data problem.These datasets are taken from the KEEL and UCI repositories.The datasets are shown in Table 2 along with their attributes such as the number of features, the number of minority class samples, the number of majority class samples, the total number of samples and the imbalance ratio.Similar to other researches, the multi-class data are transformed into two-class data by the common one-versus-all technique.Fewer samples represent the minority class and more samples represent the majority class.As seen in Table 2, kddcup-buffer_overflow_vs_back and shuttle_2_vs_5 datasets are among the most imbalanced ones which are important to be monitored in the evaluation.
Figure 5 shows the data of ecoli1 and glass0 in three modes: the main data, the output of the single-manifold under-sampling method, and the output of the proposed method (multi-manifold).In this figure x 1 and x 2 denote features that increase the differentiation between classes.It can be seen that using the proposed method will reduce the overlap between the majority and minority classes.For this purpose, average number of opposite-label neighbors is considered as a class overlap criterion.
Consider K as the number of neighboring points of each data point.For each data sample, K nearest neighbors are found.Then, for each data sample, the ratio of neighbors Fig. 3 The algorithm of the proposed method belonging to the opposite class is calculated.This value is averaged for all data points.The smaller the value is, it suggests less overlap between two classes.For this purpose, experiments are conducted in three modes of original data, after single-manifold method and after multi-manifold method on a number of data sets.Table 3 denotes the results.The results of the experiments show that in all dataset, the amount of overlap after applying the proposed method, either single-manifold or multi-manifold, is less than the amount of overlap of the original data.

Experimental setup
The evaluations of the proposed under-sampling approach have been carried out with four scenarios, and they have been compared with the results of other articles.The evaluation criteria are precision, recall, F-measure, G-Mean and accuracy.In this research, a SVM classifier with an RBF kernel, a 3NN, and a CART with MaxNumSplits = 7 is used as the classifiers so that the results are comparable with those of other articles.Unsupervised manifold learning approaches such as PCA, NPE and LPP are used in the experiments.The proposed multi-manifold method can be implemented with supervised manifold learning approaches, but due to the high execution time, they are not used.
In the first scenarios (i.e."Multi-manifold approach with reduction step of 5 percent" and "Multi-manifold approach with reduction step of 10 percent" sections), the effect of the proposed multi-manifold approach for gradual elimination of the majority samples is investigated separately with steps of 5% and 10% respectively, and the efficiency criteria are reported along with the standard deviation.In "Comparison of single-manifold and multi-manifold approaches" section, the results of the proposed multi-manifold approach and the best single-manifold results for gradual elimination with a step of 5% are compared based on the three criteria of recall, precision and F-measure, and the results together with standard deviation are reported.In "Comparison with other under-sampling approaches" section, the proposed multi-manifold approach is compared with RUS, NCL [28], OSS [27], CNN [34], ENN [35], CBU [31], and PUMD [13] 3 The average number of opposite class neighbors in three modes (i.e.original data, after using the single-manifold method, and after using the multi-manifold method) under-sampling methods.The simulation results show that our proposed method has better results than other methods on most datasets.R2018b MatLab and DRToolbox are used for evaluations.For simplicity, the optimal parameters used in the simulation, like the number of nearest neighbors to calculate the centrality (K c ) and marginality (K m ), are 5.

Evaluation criteria
In this research, common criteria such as F-measure and G-Mean are used to measure the classification quality.To calculate these criteria, it is necessary to count the number of TP, FN, FP, TN.The confusion matrix is illustrated in Table 4.In imbalanced problems, examples with positive labels represent the minority class, and examples with negative labels represent the majority class.

Multi-manifold approach with reduction step of 5 percent
In this section, the effect of the proposed multi-manifold approach for gradual elimination of the majority samples with a reduction step of 5% is investigated.Performance criteria are reported in Tables 5, 6, 7 along with the standard deviation.These evaluations are performed for All three selected classifiers.The tables show two observations.The first one is that using the proposed approach, the recall rate is much higher than the precision rate.This means that the classifiers are more successful at remembering the positive class, which is the minority class.This can denote that the approach has lowered the effect of imbalanced classes on the minority class, although at the cost of more false alarms and lower precision.The other observation is that the performances of the SVM and 3NN classifiers are similar, but the CART performance is degraded.Therefore, to avoid excessive evaluations and tables and due to the fact that KNN is the most common classifier in this regard, future evaluations will only concern the 3NN as the experimental classifier.

Multi-manifold approach with reduction step of 10 percent
In this section, the effect of the proposed new multi-manifold approach with a reduction step of 10% is investigated.The other experimental settings are the same as the previous experiments.The results of Table 8 on 3NN classifier does not show much different from the results indicated in Table 6.Therefore, we can conclude that the approach is not much dependent on the step size.Therefore, in the following experiment reduction step size of 5 is selected to avoid divergence of the proposed reduction method while maintaining a good reduction speed.

Comparison of single-manifold and multi-manifold approaches
In this section, the results of the proposed multi-manifold approach and the single-manifold approaches compared and shown in Tables 9 and 10.The numbers in parentheses show the rank of each approach for each the corresponding data separately.The average rank and efficiency of each approach are shown separately in the last row of the tables.According to Table 9, the multi-manifold approach has the best average rank considering the recall performance measure and other single-manifold approaches have won the second to fourth ranks.The experimental results which are denoted in Table 9 show a marginal superiority of the proposed multi-manifold approach over each single manifold approaches.It denotes the classification recall of the reduction approach using each manifold learning method alone is approximately the same, but using the multi-manifold approaches, we have a slightly better measure for dropping the instances.
The results of evaluation based to the average F-measure with 3NN classifier are reported in Table 10.According to Table 10, the multi-manifold approach has the best average rank and other single-manifold approaches have obtained a lower average efficiency and average rank.As seen, the effectiveness and superiority of the proposed multi-manifold approach is obvious compared to single manifold learning approaches.The main strength of the manifold based approach either single or multiple, is their impressing F measure on the highly imbalanced datasets (i.e., kddcup-buffer_overflow_ vs_back and shuttle_2_vs_5).This observation can approximately be seen for both single manifold and multi-manifold approaches.This will be discussed in the next experiments which concerns comparisons with the other state-of-the-art approaches.

Comparison with other under-sampling approaches
In this section, the results of the proposed multi-manifold approach are compared with under-sampling models such as RUS, NCL [28], OSS [27], CNN [34], ENN [35], CBU [31], and PUMD [13], and illustrated in Tables 11,12,13.Comparisons are based on recall, precision and F-measure criteria.The results of the simulation show that the F-measure of the proposed method outperforms the other under-sampling methods by a wide margin.
Average F-measure 0.87 ( First, the results of the evaluations related to the average recall are reported in Table 11.The numbers in parenthesis, show the rank of the method on that dataset.On all the data, the recall of the proposed approach ranks first and other under-sampling methods rank second to eighth.The average rank and average efficiency of each undersampling methods are shown separately in the last row of Table 11.The multi-manifold approach has the first average rank compared to other approaches.Also, it can clearly be seen that the recall of the proposed approach is considerably higher than the other approaches specially when the IR increases (refer to the rows corresponding to ecoli1, ecoli2, ecoli3, ecoli4, ecoli034_5, kddcup, page-block, vowel0 and shuttle dataset), the recall increases by a wide margin of 10 percent.Since, minority class is the positive class,  it denotes that the proposed approach is successful in reducing the impact of majority (negative) class specially when there is a high imbalance.Table 12 illustrates the results of evaluations related to the average precision criterion.As seen, the multi-manifold approach relatively degrades on this measure and has the second average rank compared to other under-sampling models.This lower precision is the observation we have seen previously in the initial experiments.This shows that the approach is tending to focus more on the positive class (minority class) and increase the recall rate with the cost of decreasing the precision.The degradation is not favorable, but the main measure that we have to focus on is F measure which is the harmonic mean of these metrics and compromise between recall and precision.The evaluations based on this measure are denoted in Table 13.
The results of evaluations of the approaches based on the average F-measure are reported in Table 13.As seen, this time, the proposed multi-manifold approach has the

Comparison with state-of-the-art under/over sampling approaches
It should be noted that in Tables 11,12, 13, the PUMD method is one of the recent under-sampling methods.However, in this section, the results of the proposed multimanifold approach are compared with some other state-of-the-art under-sampling methods such as DB_US [48], NB-Rec [49], K-US [50], state-of-the-art over-sampling methods such as BIDC1 [51] and BIDC2 [51] on KEEL and UCI data based on the F-measure and reported in Table 14.Other settings are the same as the previous experiments described.The average performance of each under-sampling and over-sampling model are shown separately in the last row of Table 14.The simulation results show that the F-measure of the proposed method has the best average rank compared to the other mentioned methods.As mentioned in the previous experiments, the proposed method can obtain significant results on very imbalanced data.

Statistical analysis by Wilcoxon test
In this research, Wilcoxon's non-parametric signed rank test is used for statistical evaluation of results.The mentioned test investigates the significant difference of F-measure between the proposed multi-manifold method and other under-sampling approaches according to Table 15.In this test, the hypotheses H0 and H1 are defined as follows: H0: There is no significant difference between the two methods.H1: There is a significant difference between the two methods.The p-value of the Wilcoxon test is reported for each pair of methods and can be seen based on the F-measure evaluation criteria according to Table 15.
As it is clear in Table 15, all p-value values are so lower than α = 0.05 and the H0 condition is rejected.Therefore, there is a significant difference between the proposed  p-value 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.01 multi-manifold method and other under-sampling methods.This means that other methods have not performed better than the proposed method, and the proposed method is significantly superior.

Evaluation and discussions on kddcup network intrusion detection dataset
One of the applications that can show the effectiveness of the proposed method specially on highly imbalanced data is the problem of network intrusion detection.For this purpose, kddcup datasets are incorporated in the research.Different versions of kddcup data are shown in Table 2.The highest imbalanced ratios of these datasets which are included in Table 2 are 73, 75 and 100, which are very significant.In Table 16, the average F-measure of different under-sampling methods and the proposed approach on these data can be seen.According to these evaluations, the proposed method has considerably performed better than other NN_HIDC [45], NBUS [13], CRIUS [1] and RBUS [48] methods.The average efficiency of the proposed method is in the first place compared to other methods.
It should be noted that the proposed multi-manifold-based under-sampling method cannot be implemented when the number of minority class samples in a data set is less than or equal to the number of features of that class.This constraint is imposed by LPP and NPE manifold learning approaches.Therefore, in this situation, we are forced to use the single-manifold method on that dataset.Therefore, in Table 16, a column titled "manifold model" is added, which shows the type of manifold learning approach (i.e.multi-manifold/single-manifold).According to Table 16, the proposed single-manifold method is evaluated on kddcup-land_vs_portsweep, kddcup-land_ vs_satan and kddcup-rootkit-imap_vs_back datasets.PCA manifold learning is used for this purpose.The multi-manifold-based under-sampling method is applied on other kddcup datasets.

Evaluations on artificial datasets
In addition to the evaluations performed on the KEEL and UCI datasets, some experiments are performed on some imbalanced artificially created datasets.These evaluations can show the stability of the proposed method on datasets with different levels of imbalance.Two synthetic datasets are generated.The first model uses uniform distribution function in a specific interval [47].Figure 6 shows 4 synthetic datasets generated using the first model.The second model uses the synthetic dataset of Two Moons [1].
Figure 7 shows 4 synthetic data sets generated using the second model.Each data contains two features, denoted by x 1 and x 2 .In the first and second models of imbalanced synthetic data generation, the imbalance ratio is from the set {1, 5, 10, 20}.In the following, the generation process of both models of imbalanced artificial data is described.

Discussion on marginality and centrality criteria
To discuss the effect of the proposed marginality and centrality degrees, the average F-measure of the proposed method is compared in three different weighting models    23.In order to weight the samples, in the first model (i.e. the first column), only the marginality degree is used.In the second model (i.e. the second column), only the degree of centrality is applied.In the third model (i.e.third column), the linear combination of marginality and centrality is experimented.The results of the experiments show that in most of the datasets, the use of the linear combination of marginality and centrality is more effective than other methods of weighting.Only for segment0 dataset, the results of the second model are better than other models, and for the vehicle2-1 dataset, the results of the first model are the best.

Computational complexity analysis
The proposed approach includes mapping stages, traditional centrality and marginality calculation, weighted centrality and marginality calculation and gradual removal of samples.Assume that n is the number of samples and D is the dimension of data.Since manifolds are trained in parallel in the proposed method, the computational complexity of the mapping part is equal to the highest computational complexity of the manifolds used.Therefore, we assume that in the worst case, the computational complexity of the mapping part is equal to that of PCA which is O(D 3 ) [52].
On the other hand, the complexity of traditional centrality and marginality calculation section depends on the complexity of k nearest neighborhoods selection.The Therefore, the complexity of calculating traditional centrality and marginality becomes 3 × O(nDk), in which 3 indicates the number of mappings.The computational complexity of computing weighted centrality and marginality depends on the number of samples, so the order becomes n × 3 × O(nDk).The complexity of the gradual under-sampling part is O(1) because it does not depend on the size of the problem.Finally, the computational complexity of the proposed multi-manifold approach is O(D 3 ) + O(n × 3 × nDk) + O(1) in the worst case.In Table 24, the average execution time of the proposed method in 5 experimental repetitions of 10-fold CV is shown.The average execution time indicates that the execution time of the proposed method is consistent with the theoretical analysis of computational complexity and follows the polynomial time order.

Conclusions and future work
Class imbalance is an important issue that is tried to be handled in this paper.This issue can be solved via under-sampling, and there are many under-sampling strategies in the literature.This paper introduces a multi-manifold learning-based technique to evaluate the importance of the data points.Different manifold learning strategies are used and assessed using a criterion based on information loss.Three linear unsupervised manifold learning methods are used in order to avoid high computing complexity.The traditional centrality and marginality degrees of the samples are computed on the manifolds and weighted by the corresponding score after computing the optimality score of each manifold.The suggested method of gradual removal attempts to balance the classes without causing the F measure to decrease on the validation dataset.The proposed approach is assessed on 22 imbalanced datasets from the KEEL and UCI repositories with different and considerable imbalance ratios using various classification metrics.The findings show that the proposed method outperforms other comparable approaches, especially on highly imbalanced problems.
The weakness of the proposed method is that if the number of minority class examples in a dataset is less than or equal to the number of features of that class, the multimanifold-based approach cannot be implemented.LPP and NPE manifold learning methods must have a minimum number of samples to be applicable.Therefore, it may be desirable to use other mapping approaches that do not have this constraint.Also, the proposed multi-manifold method performs poorly when the overlap of the classes increases.On the other hand, when the number of samples (i.e.n) increases, for example, n > 5000, the execution time increases dramatically.Therefore, for a large number of samples, more powerful hardware may be required.Supervised nonlinear manifold learning methods including Neighborhood Components Analysis (NCA), Maximally Collapsing Metric Learning (MCML) and Large-Margin Nearest Neighbor Metric Learning (LMNN) were omitted due to computational complexity and more execution time.Some other limitations of the proposed method are:

[ 21 ]
Application of automatic enhanced twin support vector machine for imbalanced data classification -Better classifier performance -Less training time -High computational complexity -Setting many parameters [22] Cost-sensitive multi-variate decision tree with hybrid feature measure on unbalanced data -Performance improvement -Reducing the cost of misclassification -Increasing the complexity -Setting many parameters [23] A new hybrid method for classification of imbalanced data -Improved performance -Very unbalanced data fit -Delete useful information -Wrong classification -Data distribution change -Increasing complexity [29] An under-sampling method with noise filtering for imbalanced data classification -Performance improvement -Improvement of AUC, F-measure and G-means -Insensitivity to minority class noise -Failure to build a learning model by removing the minority sample -Sensitive to the imbalance coefficient -Lack of efficiency in highly unbalanced data [30] Clustering-Based under-sampling for IRAHC: an under-sampling method based on hyper-rectangular clustering -Increase accuracy -Increased reduction rate -Delete examples with information -Random selection of samples

Fig. 4
Fig.4 The flowchart of the proposed method with the step-by-step calculations

Fig. 5
Fig. 5 Data distribution of Ecoli1, and Glass0 in three modes, original data, single manifold under-sampling method and the proposed multi manifold approach.Red dots denote the majority class

Fig. 6
Fig. 6 Synthetic datasets generated using the uniform model

Fig. 7
Fig. 7 Synthetic datasets generated using the second model

Table 1
Summary of the strengths and weaknesses of the most important reviewed articles

Table 1 (
Cov(x) = U PCA DU PCA T continued) and D is a diagonal matrix containing eigenvalues.Assuming U PCA =[u 1 ,u 2 , …, u d ] as the matrix of eigenvectors corresponding to eigenvalues(1) -How to set the value of k in the k-NN law -Failure to examine multi-class issues [50] Overlapping samples filter method based on k-nearest neighbor to solve the imbalanced data problem -Preventing information loss -Setting the value of k -Failure to check the high dimensions -Failure to examine multi-class issues [51] Diversity over-sampling by generative models for unbalanced binary data classification -Simple but effective idea -Diversity in prototyping -Improved performance in data with low and high imbalance ratio -Suitable for various practical scenarios -Lack of scalability in big data -Difference between original and generated data distribution

Table 2
Description of the experimental datasets

Table 4
Confusion matrix

Table 5
The performance of the SVM classifier with the multi-manifold approach with step of 5 percent

Table 6
The performance of the 3NN classifier with the multi-manifold approach with step of 5 percent

Table 7
The performance of the CART classifier with the multi-manifold approach and step of 5 percent

Table 8
The Average performance measures of the 3NN classifier with the multi-manifold approach with reduction step of 10 percent

Table 9
The average recall of the 3NN classifier in the multi-manifold and single manifold approaches

Table 10
The average F measure of the 3NN classifier in the multi-manifold and single manifold approaches

Table 11
The average recall of different under-sampling methods

Table 12
The average precision of different under-sampling methods

Table 14
Average F-measure of different state-of-the-art under-sampling and over-sampling methods as compared with the proposed approach

Table 15
The Wilcoxon test on the proposed multi-manifold method compared with other methods

Table 16
Average F-measure of different under-sampling methods on kddcup datasets

Table 21
Average F-measure of 3NN classification on the Two Moons artificial datasets

Table 22
Average F-measure of CART classification on the Two Moons artificial datasets

Table 23
Average F-measure of the proposed method with three different sample weighting models

Table 24
The average execution time of the proposed method in 5 experimental repetitions of 10-fold CV complexity of the the selection is generally O(nDk) where k < < n. computational