 Research
 Open access
 Published:
A multimanifold learning based instance weighting and undersampling for imbalanced data classification problems
Journal of Big Data volumeÂ 10, ArticleÂ number:Â 153 (2023)
Abstract
Undersampling is a technique to overcome imbalanced class problem, however, selecting the instances to be dropped and measuring their informativeness is an important concern. This paper tries to bring up a new point of view in this regard and exploit the structure of data to decide on the importance of the data points. For this purpose, a multimanifold learning approach is proposed. Manifolds represent the underlying structures of data and can help extract the latent space for data distribution. However, there is no evidence that we can rely on a single manifold to extract the local neighborhood of the dataset. Therefore, this paper proposes an ensemble of manifold learning approaches and evaluates each manifold based on an information lossbased heuristic. Having computed the optimality score of each manifold, the centrality and marginality degrees of samples are computed on the manifolds and weighted by the corresponding score. A gradual elimination approach is proposed, which tries to balance the classes while avoiding a drop in the F measure on the validation dataset. The proposed method is evaluated on 22 imbalanced datasets from the KEEL and UCI repositories with different classification measures. The results of the experiments demonstrate that the proposed approach is more effective than other similar approaches and is far better than the previous approaches, especially when the imbalance ratio is very high.
Introduction
Imbalanced learning is one of the main challenges of classification in realworld problems. This challenge occurs when the number of examples from one class (called the majority class) is greater than the number of samples from the other class (called the minority class). Imbalance problem may be inevitable and happens when it is difficult to collect minority class samples, and the majority class samples are more abundant [1, 2]. Classification problems such as fraud detection [3], image segmentation [4, 5], intrusion detection [6, 7], disease detection [8,9,10], etc. are mostly imbalanced. Dealing with this issue is challenging because traditional classification approaches have presuppositions. Their default is that the training samples are equally distributed among the classes. Therefore, the majority class prevails over the minority class, and the minority examples are ignored. The inherent characteristics of imbalanced problems, such as overlapping and inseparability of classes [11, 12], increase the complexity of such data and weaken the classification performance [1, 2, 13].
Big data is the term used to describe datasets that are very huge, complicated, and contain a tremendous amount of information. In this kind of enormous dataset, the minority class may nevertheless be represented by a sizable number of examples. Because there are a lot of minority class samples even if they only make up a small fraction of the total dataset, handling imbalance becomes more difficult. In large data analytics, dealing with class imbalance becomes essential because disregarding it may result in biased model training. Large dataset processing and machine learning model training can be time and resourceintensive. By balancing the classes, the training process can be improved, becoming more efficient and controllable.
Many studies have been done on the imbalanced data problem, and various techniques have been proposed. These techniques are divided into four main categories, which are algorithmdriven, costsensitive approaches, datadriven, and ensemble approaches. Algorithmbased methods try to adapt classifiers to imbalanced problems. These approaches try to modify the learning stage and accept the issue of imbalance in the data. In costsensitive methods, higher penalties are imposed for misclassifying minority samples, and these methods try to minimize the final penalty [14, 15].
Unlike other techniques, datadriven methods do not depend on classification and operate completely independently. These methods are usually done in the preprocessing stage. These methods benefit from two under or oversampling techniques or both and try to create a relative balance on imbalanced data. It seems that undersampling techniques are more popular than oversampling techniques because oversampling techniques cause overfitting. In ensemble approaches, several classifications are used simultaneously, and learning is done with the help of a voting technique or by combining the scores of the classifications. There is still a fundamental challenge with this type of approaches. The challenge here is how to combine the optimal classifications to increases the learning time [1, 2, 13].
Due to justifications such as applicability, generalizability and classifier independence, in this paper, an undersampling approach is proposed. Two basic problems of undersampling techniques can be pointed out, which are still a challenge among researchers. The main problems are: how many and which samples should be removed from the majority class? In this research, it is tried to overcome these two problems by introducing a new undersampling method. The proposed method is based on the hypothesis that manifolds are the structures that can reflect the density and neighborhood properties of data. But since we are not sure that which manifold best suits a specific problem and dataset, a multimanifold learning is proposed in this paper, which assesses the optimality of manifolds based on a proposed information lossbased heuristic.
The optimality indexes are used in a weighted combination of centrality and marginality criteria for the samples. The proposed approach is supposed to assign weights that determine the degree of importance of the samples from the majority class. A sequence of weights is created according to the relative importance of the samples. Then, the most insignificant samples are gradually removed from the majority class. Finally, combining the most important data of the majority class with the samples of the minority class, the training dataset is created. The proposed method is evaluated on 22 imbalanced datasets from KEEL and UCI repositories. The main contributions of this research are summarized as follows:

The samples of the majority class are weighted according to the multimanifold approach, based on a weighted combination of the centrality and marginality of the samples on each manifold.

The weights are sorted in descending order. Less important samples are gradually removed from the majority class. The remaining samples can largely represent the distribution of the data.
The proposed approach reduces the overlap of minority and majority classes and increases class separability, which causes better classification performance. A simplified graphical abstract of the proposed method is shown in Fig. 1.
The rest of the paper is organized as follows: The next section gives a brief overview of undersampling methods and related works. "Definitions and background" section introduces some definitions and the required background for better understanding the proposed approach. "Proposed method" section explains the proposed method in detail. The experimental results and discussions are in "Experimental Results and discussion" section. Finally, in the last section, the conclusions and the future research directions are discussed.
Related works
Undersampling and oversampling methods are performed in the preprocessing stage. In oversampling methods, minority samples increase. In these methods, new samples are usually created around the minority samples, or the minority samples are repeated again. Increasing the sample can cause an overfitting problem [16].
Researchers introduced ensemble approaches based on bagging, such as the Overbagging method [17]. Various versions of the SMOTE technique, which is an oversampling method, are presented [18]. The SMOTE technique synthesizes a number of new samples with the help of k nearest neighbors by randomly choosing a neighbor among the minority samples for each sample. Along with its advantages, the SMOTE algorithm will include problems such as overgeneralization and variation in convergence [19]. Researchers presented the SMOTEBagging [17] and SMOTEBoost [20] methods, which are a combination of SMOTE, bagging, and boosting. These methods have a high level of computational complexity, although they are significant in terms of performance. In addition, these methods cause overfitting and have many parameters to adjust [21, 22].
There are numerous undersampling methods in the literature. Random undersampling deals with the random removal of majority class samples. This may remove useful instances. This method is combined with ensemble approaches [23]. The underbagging method is a combination of the ensemble method based on bagging and the random undersampling method [24]. Seiffert et al. presented the RUSBoost method, which is a combination of random undersampling and boosting approaches [25]. Researchers proposed deundersampling methods called Near Miss, which consider the elimination of majority samples according to their distance from minority samples [26].
Undersampling methods can be divided into two main groups, including methods based on KNN (k nearest neighbors) [27,28,29] and methods based on kmeans [30,31,32]. Some undersampling methods eliminate the majority of samples based on the information obtained from the nearest neighbors of the samples. The purpose of these methods is to remove samples that are located in marginal areas or are noisy and redundant. Kubat and Matwin [27] presented an undersampling method called the One Side Selection method (OSS), which is one of the applications of Tomek links [33]. The samples in Tomek's links are considered marginal or noise samples. In the condensed nearest neighbor (CNN) method, if the label of a sample is the same as the label of its nearest neighbor (1NN), this sample is considered redundant [34]. In the OSS method, a large number of majority samples that are borderline, noisy, or redundant are removed. Removing a large number of samples reduces the performance of the classifier.
Laurikkala proposed the Neighborhood Cleaning Rule (NCL) to remove the samples of the majority class [28]. This method uses the Edited Nearest Neighbor (ENN) method [35] to eliminate the samples. In ref. [28], samples whose marginal score is more than two are eliminated. Also, samples are removed if one of the three nearest neighbors is from the minority class. In ref. [13], an undersampling method is proposed that uses the density of the data to progressively remove data points from the majority class. Two factors are proposed to measure the degree of importance degree of each instance. Furthermore, the optimal undersampling level is determined progressively.
In addition to eliminating the majority samples, Kang and his associates [29] also eliminated the noise in the minority class. They separated the minority samples into three groups: noisy, informative, and relatively informative. As a result, the classifier's performance will improve after the noisy minority samples have been eliminated. The exclusion of minority samples from the algorithm could lead to the failure of the classifier, which makes figuring out the value of the parameter k crucial.
Yang et al. proposed an undersampling method that uses the natural neighborhood graph (NaNG). With the help of this graph, they are able to classify the training samples into central, marginal, and noise samples. They are able to undersample by removing noisy and redundant samples. They called their sample reduction method NNGIR. One of the strengths of their methods is that they are nonparametric, increasing the reduction rate and improving prediction accuracy. The disadvantages of their method are the dependence on parameters and relatively low accuracy [36]. Hamidzadeh et al. [37] presented the LMIRA undersampling method. They removed the nonmarginal samples and kept the marginal samples. They considered their method a constrained binary optimization problem and used the filled function algorithm to solve it.
Pang and his colleagues [38] introduced a new secure undersampling method called SIRKMTSVM. In ref. [38], most redundant samples are removed from both the majority and minority classes. One of the advantages of their method is its use for largescale problems. The disadvantages include high computational complexity and the removal of informative samples. Hamidzadeh and colleagues [39] proposed an undersampling method that solves the instance reduction problem as an unconstrained multiobjective optimization problem. They designed a weighted optimizer and searched for the appropriate samples with the help of chaotic krillherd evolutionary algorithm. The advantage of the method is the improvement in accuracy, geometric mean, and calculation time. The main weakness of the method is that it can only be applied to normalsized datasets.
Another common undersampling technique is clustering, which helps to have a logical training set [40]. Researchers named Chen and Shiu [30] put the majority of samples in k different clusters and used the kmeans clustering method. Then, by combining each cluster from the majority class with the minority samples, they created new data groups that are more balanced. Each data group is trained separately and builds a classification model. Finally, all models are aggregated together to predict new samples. The weakness of this algorithm is that it has not determined how the value of the parameter k is determined.
Yen and Lee [32] used clustering to propose their undersampling method. They identified representatives of the majority class to create new training data. They first divided the entire training data into k clusters. They performed clustering based on a ratio of majority samples to minority data. The weakness of the algorithm is that it does not specify how to value the parameters k and m. Lin and his colleagues [31] presented a new undersampling method that clusters the majority samples with the kmeans method, where the value of k is equal to the number of minority samples.
In addition to the mentioned methods, some researchers [41,42,43] used kmeans clustering before applying the undersampling method to determine the type of the majority sample in terms of noise, redundancy, or marginality. Hamidzadeh and his colleagues [44] introduced an undersampling method based on hyperrectangle clustering and called their method IRAHC. They removed the central samples and kept the marginal and nearmarginal samples.
Huang and colleagues [45] introduced a neural network algorithm (NN_HIDC) for the classification problem of highly imbalanced data. They proposed the generalized gradient descent algorithm. This algorithm is used in resampling and reweighting methods in neural network. They extended the locally controllable bound to reduce the insufficient empirical representation of the positive class. The advantages of this algorithm can be mentioned in its use for any very imbalanced data, and the weaknesses of the algorithm are that the extended gradient of the positive class can only reach the local border and gradient measurement is required for all samples in each iteration. Koziarski [46] introduced a radial undersampling (RBU) method in the classification problem of imbalanced data. RBU uses the concept of mutual class potential in the undersampling method. The advantages of the method include reducing the time complexity compared to RBO, being effective on the difficult minority class that includes small disjunct, outliers and small number of samples, overcoming the limitations of neighborhoodbased methods.
Sun et al. [47] introduced a radial undersampling approach with adaptive undersampling ratio determination. They called their algorithm RBUAR. This method determines the appropriate undersampling ratio according to the complexity of the class overlap and does not use the default value of one or trial and error. The advantages of their approach are better performance in high overlap. The weakness of their approach is the lack of application in multiclass problems. Mayabadi and colleagues [48] proposed two densitybased algorithms to remove overlap, noise and balance between classes. The first algorithm uses undersampling technique and the second algorithm uses undersampling and oversampling simultaneously. Their method removes high density samples from the majority class. The advantages of their algorithms are maintaining the class structure as much as possible and improving performance.
Vuttipittayamongkol and Elyan [49] proposed an undersampling method to solve binary data imbalance problem by removing overlapping data. They focused on the detection and elimination of overlapping majority samples. The advantages of their algorithm are preventing information loss and improving the sensitivity criterion. The weak points of the algorithm are how to set the value of k in the kNN law and the failure to examine multiclass problems. Nwe and colleagues [50] introduced an effective undersampling method using knearestneighborbased overlapping samples filter to classify imbalanced and overlapping data. The advantage of their algorithm is to prevent information loss. The weak points of their algorithm are setting the value of k, not checking high dimensions and not checking multiclass problems.
Zhai and colleagues [51] proposed two diversity oversampling methods, BIDC1 and BIDC2, which were based on generative models. BIDC1 and BIDC2 methods use extreme machine autoencoder and generative adversarial network, respectively. Among the advantages of their methods, we can mention the simple but effective idea, improving performance in data with low and high imbalance ratio, suitable for different practical scenarios, creating variety in oversampling and preventing overlap of classes. The weaknesses of their method are the lack of scalability in big data and the difference between the original and generated data distribution. Table 1 summarizes of the strengths and weaknesses of the some of the main approaches in this regard.
Definitions and background
For better explanation of the proposed approach, in this section some primary definitions and backgrounds are explained.
Definitions
Manifold: Manifold refers to any process, curve, or complex nonlinear shape. In fact, in the manifold learning method, the system's intrinsic parameters are identified, and the entire data set is placed on a manifold that expresses the intrinsic relationship between the data in a space with less dimension.
Multimanifold learning: In pattern recognition, we often encounter situations where the data set is not on a manifold. In other words, if the dataset has several classes, the data for each class will have a separate manifold.
Traditional Degree of centrality: If the data is in the center of the class in such a way that its label is the same as the label of its K_{c} nearest neighbors, then it has a degree of centrality. The degree of centrality of a sample is greater when the number of neighbors with the same label is greater than the number of neighbors with the opposite label, or most of its neighbors are of the same class.
Traditional Degree of marginality: If a data point is on the edge or border of the class and its label is not the same as all the neighboring samples or some of its K_{m} nearest neighbors, then the degree of marginality can be considered for this data point. The degree of marginality of a sample is higher when the number of neighbors from the opposite class is greater than the number of neighbors from the same class.
Manifold learning
The purpose of manifold learning algorithms is to map a set of data with high dimensions to a set of data with smaller dimensions in such a way that the distance between samples in the lower dimensional subspace is close to the distance between samples in the original space. Assume that x_{i} is data in a highdimensional space, and data set \(X=\left({x}_{1},{x}_{2},\dots ,{x}_{n}\right)\in {R}^{n\times D}\) represents n data in a space with D dimensions. Manifold learning methods seek to represent this set of data in a space with lower dimensions, d, which is much lower than the dimensions of the data in the original representation space, i.e. dâ€‰<â€‰â€‰<â€‰D. Supposing y_{i} as a data in the lower dimensional space, the corresponding data set in the lowdimensional space can be expressed as Yâ€‰=â€‰{y_{1}, y_{2},â€¦,y_{n}}â€‰âˆˆâ€‰R^{dÃ—n}. In this way, manifold learning is a process that calculates Y while maintaining the inherent connection of data, in such a way that the manifold resulting from Y in a lowdimensional space is the most similar to the manifold resulting from X in a highdimensional space.
Manifold learning approaches are divided into different aspects of view but here, since we have a concern for computational complexity, we have limited the employed algorithms to linear unsupervised manifold learning methods. Among the main characteristics of linear manifold learning methods that make them appropriate for the approach, is their outofsample mapping property. It means that they can map the test data to a lowdimensional space using the mapping matrix obtained from the training data. In the following, the manifold learning methods included in the proposed approach are briefly introduced.
Principal component analysis (PCA)
Principal component analysis is one of the most common global and linear methods of manifold learning and dimensionality reduction. The main idea of PCA is to find the linear subspace in the lowdimensional space that best fits the scatter of the data in the highdimensional space. By defining the covariance matrix of the data in the highdimensional space, Cov(X), and due to the nonnegativity and symmetry of the covariance matrix we have:
In which \({\mathrm{U}}_{PCA}\in {R}^{D\times D}\) is an orthogonal identity matrix (\({{\mathrm{U}}_{PCA}}^{T}{\mathrm{U}}_{PCA}=I\)) containing eigenvectors of Cov(X) and D is a diagonal matrix containing eigenvalues. Assuming \({\mathrm{U}}_{PCA}\)=[u_{1},u_{2}, â€¦, u_{d}] as the matrix of eigenvectors corresponding to eigenvalues 0\(\le {\lambda }_{d}\le {\lambda }_{d1}\le \dots \le {\lambda }_{1}\), it is proved that Î»_{i} represents the data scatter after a linear mapping by\({\mathrm{U}}_{PCA}\). As a result, data in the lower dimensional subspace is as follows
Neighborhood preserving embedding (NPE)
The NPE algorithm is one of the popular local methods in manifold learning. This algorithm includes three steps. The first step is to determine the neighbors of each data point. The second step is to form the neighborhood graph matrix, W, and the third step is to calculate the transformation matrix, U_{NPE}, using W, after solving the following convex optimization problem.
where \(M={({\mathrm{I}}_{N}\mathrm{W})}^{T}({\mathrm{I}}_{N}\mathrm{W})\). After finding the optimal solutions for U_{NPE}, any data point x can be linearly mapped to the new subspace y using yâ€‰=â€‰\({U}_{NPE}^{T}x.\)
Locality preserving projection (LPP)
LPP manifold learning is a local method that, again includes the three main steps of neighbor finding, graph formation, and embedded data extraction. Determining the neighbor and how to form the LPP manifold graph are completely the same as the other local manifold learning methods and it differs from the other methods only in data extraction step. In fact, LPP manifold learning is a linear learning method in which the data mapping matrix from highdimensional space to lowdimensional space is obtained from Eq. (4):
where U_{LPP} is the mapping matrix, Lâ€‰=â€‰OW, W is the local manifold graph and O is the diagonal matrix with diagonal elements equal to \(\sum_{j}{w}_{ij}\). In this method, the mapping matrix can be calculated as an eigenvalue problem. After calculating the mapping matrix, the data representation in the lowdimensional space will be Yâ€‰=â€‰\({U}_{LPP}^{T}X\).
Proposed method
The logic of the proposed approach
As discussed, every subspace of data can be expressed by a manifold. The problem is that it is not possible to find out which manifolds the data points obey in each subspace or which manifolds the distribution of the data sample is based on. On the other hand, the data structure may be so complex that a concrete manifold is not appropriate.
Therefore, we use an alternative method. Instead of having multiple manifolds where each manifold represents a part of the data, we choose multiple manifolds to represent all the data samples, but according to the weight or the degree of suitability in maintaining the local neighborhood structure, optimality weights are assigned to each manifold. Consider Fig. 2. The graph in which we have a series of data points is the red dotted curve, and we don't know what their structure is. Instead of considering a complex nonlinear function (manifold) for this data, we combine several linear functions (orange curves) and assign a combination weight to each manifold based on the degree of similarity to the structure of the whole data. We consider this linear combination of simpler manifolds as a suitable approximation of a more complex function.
In this paper, data importance weighting considers novel measures of the marginality and centrality of data on the data manifolds and then scores the data points based on a weighted combination of these measures. The algorithm gradually removes the data samples from the majority class until a definite termination condition is met. To measure and optimize the manifolds, a distancebased information loss heuristic is proposed.
In the proposed method, different manifolds of data are extracted and used to select the neighbors of the sample that belong to the majority class. For this purpose, several manifold learning approaches, namely principal component analysis (PCA), neighborhood preserving embedding (NPE), and locality preserving projection (LPP), are applied.
Mapped majority class data, \({\mathrm{X}}^{N}\), on the extracted manifolds are denoted by \({Y}^{N}\). The three manifolds are trained in parallel. For each of the mentioned manifolds, \({M}_{i}\), a coefficient \(\alpha ({M}_{i})\) is calculated, which indicates the optimality of the manifold. Instance weighting is done based on two criteria of centrality and marginality on the extracted manifolds, separately. The final centrality and marginality criterion for the sample \({x}_{i}\) is obtained from the \(\alpha ({M}_{i})\) weighted combination of centralities and marginalities obtained on each manifold \({M}_{i}\). In other words, \(Centrality({{x}_{i},M}_{i})\), which expresses the centrality degree of \({x}_{i}\) on manifold \({M}_{i}\), and \(Marginality({x}_{i}, {M}_{i})\), which expresses the marginality score of \({x}_{i}\) on \({M}_{i}\), are weighted by the parameter \(\alpha ({M}_{i})\), to construct the final score. Then the samples are sorted based on their centrality and marginality degrees and the unnecessary samples are excluded using an iterative strategy. The following sections, explain the approach in more detail.
Multimanifold learning approach
Assume that \(X=\left({x}_{1}^{l},{x}_{2}^{l},\dots ,{x}_{n}^{l}\right)\in {R}^{n\times D}\) refers to a set of n data points in a space with dimension D, where \(l\left({x}_{i}\right)\) is the class label of \({x}_{i}\) and \(i= \{\mathrm{1,2},\dots ,n\}\). As stated before, a subset \({\mathrm{X}}^{N}\) from \(X\) which correspond to the larger class, is considered a the majority samples. In the multimanifold learning approach, several manifolds are trained on \({\mathrm{X}}^{N}\). A coefficient \(\alpha ({M}_{i})\) is calculated for each manifold, \({M}_{i}\), which aims to indicate the optimality of that manifold for \({\mathrm{X}}^{N}\).
In the initial experiments, supervised nonlinear manifold learning methods including Neighborhood Component Analysis (NCA), Maximally Collapsing Metric Learning (MCML) and LargeMargin Nearest Neighbor Metric Learning (LMNN) were used in the proposed approach, but due to the higher complexity and more execution time, the continuation of experiments with these manifolds was abandoned. The execution time of supervised manifolds, including NCA, MCML, and LMNN, increases greatly when the number of dataset samples is close to or more than 1000. Therefore, in this paper, unsupervised manifold learning approaches such as PCA, NPE, and LPP are investigated.
Manifolds optimality determination
In the second step, the manifolds of the majority class are assessed to see if they fit the neighborhood structure of the class. The goal is to give higher score to the manifolds that best fit the data of the majority class. The idea for this manifold weighting is simple. After the mapping of the original data, \({\mathrm{X}}^{N}\), there will be a distance between the original data and the mapped samples, \({\mathrm{Y}}^{N}\). Here, an information loss criterion which is denoted as the distances between the initial data points and their mappings is used according to Eqs. (5) and (6) to score the manifolds.
Set of new data points \({\mathrm{Y}}^{N}\) in the latent space is obtained by mapping the majority samples \({\mathrm{X}}^{N}\) onto the manifold \({M}_{i}\). The set of \({\mathrm{Y}}^{N}\) will be obtained using a linear transformation like \({\mathrm{Y}}^{N}=U{\mathrm{X}}^{N}\), where U is the mapping matrix. Then, mapping distance, i.e., the distance between the points of \({\mathrm{X}}^{N}\) and their corresponding latent representation \({\mathrm{Y}}^{N}\), is calculated for each manifold \({M}_{i}\) according to Eq. (5). The smaller the distance, the better is the manifold. Suppose that the number of data samples equal \({n}_{c}\). If the sum of distances is divided by the number of samples, and the average distance is obtained. Each manifold \({M}_{i}\) can be weighted according to the inverse value of the average distances according to Eq. (5). The higher the value of \(\alpha \left({M}_{i}\right)\), the better this manifold has preserved the neighborhood structure of the data.
Weighted combination of centrality and marginality in the multimanifold approach
Instance selection is based on two criteria of centrality and marginality. The combinatorial criteria of centrality and marginality for each data sample \({x}_{i}^{N}\) is calculated based on Eqs. (7) and (8). Equation (7) denotes the degree of centrality for sample \({x}_{i}^{N}\) which is obtained from the weighted combination of centralities of the data point over the learned manifolds (i.e. PCA, NPE and LPP). Equation (8) shows the marginality degree for sample \({x}_{i}^{N}\) which is obtained from the weighted combination of marginalities obtained over the mentioned manifolds.
Gradual undersampling of data
In the sample reduction stage, first the marginal samples which may be outliers or noise samples, and then the central samples are gradually removed from the majority class with a specific reduction step. The relation for calculating the weight of each sample from the majority class can be written according to Eq. (9). This relationship means that the coefficient of sample \({x}_{i}^{N}\) is obtained from the linear combination of centrality and marginality degrees.
After calculating the weight for all samples in the majority class, a sequence of weights is created. The sequence of the weights is sorted in descending order and gradually remove the majority samples with a specific step. A high value for the sample weight, \(W\left({x}_{i}^{N}\right)\), means that the sample tends to be an outlier and is a good choice to be removed. By removing a portion of the data (i.e., 5 or 10 percent in the experiments) as marginal data, the overlapping of the majority and minority classes will decrease. On the other hand, by removing marginal samples from the majority class, it helps to better separate the majority and minority classes. This process continues until the size of the minority and majority classes is equal or the Fmeasure on the validation set starts reducing. Figure 3 shows the algorithm of the proposed method. Figure 4 shows the flowchart of the proposed method along with all the calculation steps.
Experimental results and discussion
In this section, many experiments have been conducted with the aim of comparing the proposed multimanifold approach with other methods. For example, the proposed multimanifold approach is compared with the singlemanifold approaches of PCA, NPE and LPP. Also, the proposed method is compared with RUS, NCL [28], OSS [27], CNN [34], ENN [35], CBU [31], and PUMD [13] undersampling approaches using support vector machine (SVM), k nearest neighbors (kNN) and classification and regression trees (CART) with a 10fold cross evaluation scheme and in 5 repetitions.
These evaluations are performed on KEEL and UCI datasets based on various efficiency criteria. The mentioned methods have been chosen for comparison because they are among the most common undersampling methods in literature reviews. Also, a nonparametric Wilcoxon signed rank test is used for statistical evaluation of the results. The details are explained in the following sections.
Datasets
In this research, 22 datasets are used in the experiments. These datasets are standard datasets and are usually used in the evaluation of the imbalanced data problem. These datasets are taken from the KEEL and UCI repositories. The datasets are shown in Table 2 along with their attributes such as the number of features, the number of minority class samples, the number of majority class samples, the total number of samples and the imbalance ratio. Similar to other researches, the multiclass data are transformed into twoclass data by the common oneversusall technique. Fewer samples represent the minority class and more samples represent the majority class. As seen in Table 2, kddcup buffer_overflow_vs_back and shuttle_2_vs_5 datasets are among the most imbalanced ones which are important to be monitored in the evaluation.
Figure 5 shows the data of ecoli1 and glass0 in three modes: the main data, the output of the singlemanifold undersampling method, and the output of the proposed method (multimanifold). In this figure x_{1} and x_{2} denote features that increase the differentiation between classes. It can be seen that using the proposed method will reduce the overlap between the majority and minority classes. For this purpose, average number of oppositelabel neighbors is considered as a class overlap criterion.
Consider K as the number of neighboring points of each data point. For each data sample, K nearest neighbors are found. Then, for each data sample, the ratio of neighbors belonging to the opposite class is calculated. This value is averaged for all data points. The smaller the value is, it suggests less overlap between two classes. For this purpose, experiments are conducted in three modes of original data, after singlemanifold method and after multimanifold method on a number of data sets. Table 3 denotes the results. The results of the experiments show that in all dataset, the amount of overlap after applying the proposed method, either singlemanifold or multimanifold, is less than the amount of overlap of the original data.
Experimental setup
The evaluations of the proposed undersampling approach have been carried out with four scenarios, and they have been compared with the results of other articles. The evaluation criteria are precision, recall, Fmeasure, GMean and accuracy. In this research, a SVM classifier with an RBF kernel, a 3NN, and a CART with MaxNumSplitsâ€‰=â€‰7 is used as the classifiers so that the results are comparable with those of other articles. Unsupervised manifold learning approaches such as PCA, NPE and LPP are used in the experiments. The proposed multimanifold method can be implemented with supervised manifold learning approaches, but due to the high execution time, they are not used.
In the first scenarios (i.e. "Multimanifold approach with reduction step of 5 percent" and "Multimanifold approach with reduction step of 10 percent" sections), the effect of the proposed multimanifold approach for gradual elimination of the majority samples is investigated separately with steps of 5% and 10% respectively, and the efficiency criteria are reported along with the standard deviation. In "Comparison of singlemanifold and multimanifold approaches" section, the results of the proposed multimanifold approach and the best singlemanifold results for gradual elimination with a step of 5% are compared based on the three criteria of recall, precision and Fmeasure, and the results together with standard deviation are reported. In "Comparison with other undersampling approaches" section, the proposed multimanifold approach is compared with RUS, NCL [28], OSS [27], CNN [34], ENN [35], CBU [31], and PUMD [13] undersampling methods. The simulation results show that our proposed method has better results than other methods on most datasets.
R2018b MatLab and DRToolbox are used for evaluations. For simplicity, the optimal parameters used in the simulation, like the number of nearest neighbors to calculate the centrality (K_{c}) and marginality (K_{m}), are 5.
Evaluation criteria
In this research, common criteria such as Fmeasure and GMean are used to measure the classification quality. To calculate these criteria, it is necessary to count the number of TP, FN, FP, TN. The confusion matrix is illustrated in Table 4. In imbalanced problems, examples with positive labels represent the minority class, and examples with negative labels represent the majority class.
Precision, Recall, Fmeasure, and accuracy criteria can be calculated by Eqs. (13) to (19).
Simulation results
Multimanifold approach with reduction step of 5 percent
In this section, the effect of the proposed multimanifold approach for gradual elimination of the majority samples with a reduction step of 5% is investigated. Performance criteria are reported in Tables 5, 6, 7 along with the standard deviation. These evaluations are performed for All three selected classifiers.
The tables show two observations. The first one is that using the proposed approach, the recall rate is much higher than the precision rate. This means that the classifiers are more successful at remembering the positive class, which is the minority class. This can denote that the approach has lowered the effect of imbalanced classes on the minority class, although at the cost of more false alarms and lower precision. The other observation is that the performances of the SVM and 3NN classifiers are similar, but the CART performance is degraded. Therefore, to avoid excessive evaluations and tables and due to the fact that KNN is the most common classifier in this regard, future evaluations will only concern the 3NN as the experimental classifier.
Multimanifold approach with reduction step of 10 percent
In this section, the effect of the proposed new multimanifold approach with a reduction step of 10% is investigated. The other experimental settings are the same as the previous experiments. The results of Table 8 on 3NN classifier does not show much different from the results indicated in Table 6. Therefore, we can conclude that the approach is not much dependent on the step size. Therefore, in the following experiment reduction step size of 5 is selected to avoid divergence of the proposed reduction method while maintaining a good reduction speed.
Comparison of singlemanifold and multimanifold approaches
In this section, the results of the proposed multimanifold approach and the singlemanifold approaches compared and shown in Tables 9 and 10. The numbers in parentheses show the rank of each approach for each the corresponding data separately. The average rank and efficiency of each approach are shown separately in the last row of the tables. According to Table 9, the multimanifold approach has the best average rank considering the recall performance measure and other singlemanifold approaches have won the second to fourth ranks.
The experimental results which are denoted in Table 9 show a marginal superiority of the proposed multimanifold approach over each single manifold approaches. It denotes the classification recall of the reduction approach using each manifold learning method alone is approximately the same, but using the multimanifold approaches, we have a slightly better measure for dropping the instances.
The results of evaluation based to the average Fmeasure with 3NN classifier are reported in Table 10. According to Table 10, the multimanifold approach has the best average rank and other singlemanifold approaches have obtained a lower average efficiency and average rank. As seen, the effectiveness and superiority of the proposed multimanifold approach is obvious compared to single manifold learning approaches. The main strength of the manifold based approach either single or multiple, is their impressing F measure on the highly imbalanced datasets (i.e., kddcupbuffer_overflow_vs_back and shuttle_2_vs_5). This observation can approximately be seen for both single manifold and multimanifold approaches. This will be discussed in the next experiments which concerns comparisons with the other stateoftheart approaches.
Comparison with other undersampling approaches
In this section, the results of the proposed multimanifold approach are compared with undersampling models such as RUS, NCL [28], OSS [27], CNN [34], ENN [35], CBU [31], and PUMD [13], and illustrated in Tables 11, 12, 13. Comparisons are based on recall, precision and Fmeasure criteria. The results of the simulation show that the Fmeasure of the proposed method outperforms the other undersampling methods by a wide margin.
First, the results of the evaluations related to the average recall are reported in Table 11. The numbers in parenthesis, show the rank of the method on that dataset. On all the data, the recall of the proposed approach ranks first and other undersampling methods rank second to eighth. The average rank and average efficiency of each undersampling methods are shown separately in the last row of Table 11. The multimanifold approach has the first average rank compared to other approaches. Also, it can clearly be seen that the recall of the proposed approach is considerably higher than the other approaches specially when the IR increases (refer to the rows corresponding to ecoli1, ecoli2, ecoli3, ecoli4, ecoli034_5, kddcup, pageblock, vowel0 and shuttle dataset), the recall increases by a wide margin of 10 percent. Since, minority class is the positive class, it denotes that the proposed approach is successful in reducing the impact of majority (negative) class specially when there is a high imbalance.
Table 12 illustrates the results of evaluations related to the average precision criterion. As seen, the multimanifold approach relatively degrades on this measure and has the second average rank compared to other undersampling models. This lower precision is the observation we have seen previously in the initial experiments. This shows that the approach is tending to focus more on the positive class (minority class) and increase the recall rate with the cost of decreasing the precision. The degradation is not favorable, but the main measure that we have to focus on is F measure which is the harmonic mean of these metrics and compromise between recall and precision. The evaluations based on this measure are denoted in Table 13.
The results of evaluations of the approaches based on the average Fmeasure are reported in Table 13. As seen, this time, the proposed multimanifold approach has the first average rank by a wide margin and the best average performance compared to other undersampling models.
Apart from the intrinsic effectiveness of the proposed approach in reducing the samples from the majority class, a very interesting observation is seen in these experiments. As discussed previously in Sect. "Datasets" and Table 2, there are two highly imbalance datasets in the experiments, namely kddcupbuffer_overflow_vs_back and shuttle_2_vs_5. Fortunately, the proposed approach shows a significant performance on these datasets which is more than 10 percent better than the next competing approach. This shows the effectiveness, scalability and generalization of the proposed approach on highly imbalanced data.
Comparison with stateoftheart under/over sampling approaches
It should be noted that in Tables 11, 12, 13, the PUMD method is one of the recent undersampling methods. However, in this section, the results of the proposed multimanifold approach are compared with some other stateoftheart undersampling methods such as DB_US [48], NBRec [49], KUS [50], stateoftheart oversampling methods such as BIDC1 [51] and BIDC2 [51] on KEEL and UCI data based on the Fmeasure and reported in Table 14. Other settings are the same as the previous experiments described. The average performance of each undersampling and oversampling model are shown separately in the last row of Table 14. The simulation results show that the Fmeasure of the proposed method has the best average rank compared to the other mentioned methods. As mentioned in the previous experiments, the proposed method can obtain significant results on very imbalanced data.
Statistical analysis by Wilcoxon test
In this research, Wilcoxon's nonparametric signed rank test is used for statistical evaluation of results. The mentioned test investigates the significant difference of Fmeasure between the proposed multimanifold method and other undersampling approaches according to Table 15. In this test, the hypotheses H0 and H1 are defined as follows:
H0: There is no significant difference between the two methods.
H1: There is a significant difference between the two methods.
The pvalue of the Wilcoxon test is reported for each pair of methods and can be seen based on the Fmeasure evaluation criteria according to Table 15.
As it is clear in Table 15, all pvalue values are so lower than Î±â€‰=â€‰0.05 and the H0 condition is rejected. Therefore, there is a significant difference between the proposed multimanifold method and other undersampling methods. This means that other methods have not performed better than the proposed method, and the proposed method is significantly superior.
Evaluation and discussions on kddcup network intrusion detection dataset
One of the applications that can show the effectiveness of the proposed method specially on highly imbalanced data is the problem of network intrusion detection. For this purpose, kddcup datasets are incorporated in the research. Different versions of kddcup data are shown in Table 2. The highest imbalanced ratios of these datasets which are included in Table 2 are 73, 75 and 100, which are very significant.
In Table 16, the average Fmeasure of different undersampling methods and the proposed approach on these data can be seen. According to these evaluations, the proposed method has considerably performed better than other NN_HIDC [45], NBUS [13], CRIUS [1] and RBUS [48] methods. The average efficiency of the proposed method is in the first place compared to other methods.
It should be noted that the proposed multimanifoldbased undersampling method cannot be implemented when the number of minority class samples in a data set is less than or equal to the number of features of that class. This constraint is imposed by LPP and NPE manifold learning approaches. Therefore, in this situation, we are forced to use the singlemanifold method on that dataset. Therefore, in Table 16, a column titled "manifold model" is added, which shows the type of manifold learning approach (i.e. multimanifold/singlemanifold). According to Table 16, the proposed singlemanifold method is evaluated on kddcupland_vs_portsweep, kddcupland_vs_satan and kddcuprootkitimap_vs_back datasets. PCA manifold learning is used for this purpose. The multimanifoldbased undersampling method is applied on other kddcup datasets.
Evaluations on artificial datasets
In addition to the evaluations performed on the KEEL and UCI datasets, some experiments are performed on some imbalanced artificially created datasets. These evaluations can show the stability of the proposed method on datasets with different levels of imbalance. Two synthetic datasets are generated. The first model uses uniform distribution function in a specific interval [47]. Figure 6 shows 4 synthetic datasets generated using the first model. The second model uses the synthetic dataset of Two Moons [1]. Figure 7 shows 4 synthetic data sets generated using the second model. Each data contains two features, denoted by x_{1} and x_{2}. In the first and second models of imbalanced synthetic data generation, the imbalance ratio is from the set {1, 5, 10, 20}. In the following, the generation process of both models of imbalanced artificial data is described.
The first model: The process of generating synthetic data sets from the first model is as follows: The minority class includes 100 data samples, which are shown with blue circles in Fig. 6. The values of the first feature (x_{1}) and the values of the second feature (x_{2}) are randomly extracted from a uniform distribution. The values of x_{1} are selected from the interval [50,100] while the values of x_{2} are selected from the interval [0,100]. The majority class includes N_{majority} of data samples, which are indicated by red circles in Fig. 6. The N_{majority} variable takes values from the set {100, 500, 1000, 2000}. The values of the first attribute (x_{1}) and the values of the second attribute (x_{2}) of the majority class are created in the same way as the minority class, with the difference that the values of x_{1} are extracted from the interval [0,50], while the values of x_{2} are extracted from the interval of values [0,100]. The number of samples in the majority class controls the imbalance ratio (IR) in the generated imbalanced dataset.

a.
If N_{majority}â€‰=â€‰100, Fig. 6a, IRâ€‰=â€‰1.

b.
If N_{majority}â€‰=â€‰500, Fig. 6b, IRâ€‰=â€‰5.

c.
If N_{majority}â€‰=â€‰1000, Fig. 6c, IRâ€‰=â€‰10.

d.
If N_{majority}â€‰=â€‰2000, Fig. 6d, IRâ€‰=â€‰20.
The second model: The process of generating the Two Moons data set is as follows: the majority class contains 700 data samples, which are marked with red circles in Fig. 7. The samples of the majority class that make up the upper moon are created with the center (0, 0). Minority class includes N_{minority} data samples, which are shown with blue circles in Fig. 7. The N_{minority} variable takes values from the set {35, 70, 140, 700}. Minority class samples that form the lower moon are created with center (0, 1). The number of minority class samples controls the imbalance ratio (IR) in the generated imbalanced Two Moons dataset.

a.
If N_{minority}â€‰=â€‰700, Fig. 7a, the IRâ€‰=â€‰1.

b.
If N_{minority}â€‰=â€‰140, Fig. 7b, the IRâ€‰=â€‰5.

c.
If N_{minority}â€‰=â€‰70, Fig. 7c, the IRâ€‰=â€‰10.

d.
If N_{minority}â€‰=â€‰35, Fig. 7d, the IRâ€‰=â€‰20.
The results of the evaluations (by SVM, 3NN and CART classifiers) on the imbalanced artificially created datasets with uniform distribution are shown in Tables 17, 18, 19. The average Fmeasure is reported in two situations of the original data and the undersampled data. The results of experiments with 3NN classification show that when the imbalance coefficient increases, the proposed method is more effective, and the average Fmeasure of the proposed method is increased compared to the original data state.
The results of the propose approach on Two Moons imbalanced data are illustrated in Tables 20, 21, 22. The average Fmeasure is reported in two situations of the original data and the undersampled data. When IRâ€‰=â€‰20, the average Fmeasure has increased from 91 to 100%. The results of experiments with CART classification also show that the results of the proposed method are better than the original data.
Discussion on marginality and centrality criteria
To discuss the effect of the proposed marginality and centrality degrees, the average Fmeasure of the proposed method is compared in three different weighting models in Table 23. In order to weight the samples, in the first model (i.e. the first column), only the marginality degree is used. In the second model (i.e. the second column), only the degree of centrality is applied. In the third model (i.e. third column), the linear combination of marginality and centrality is experimented. The results of the experiments show that in most of the datasets, the use of the linear combination of marginality and centrality is more effective than other methods of weighting. Only for segment0 dataset, the results of the second model are better than other models, and for the vehicle21 dataset, the results of the first model are the best.
Computational complexity analysis
The proposed approach includes mapping stages, traditional centrality and marginality calculation, weighted centrality and marginality calculation and gradual removal of samples. Assume that n is the number of samples and D is the dimension of data. Since manifolds are trained in parallel in the proposed method, the computational complexity of the mapping part is equal to the highest computational complexity of the manifolds used. Therefore, we assume that in the worst case, the computational complexity of the mapping part is equal to that of PCA which is O(D^{3}) [52]. On the other hand, the complexity of traditional centrality and marginality calculation section depends on the complexity of k nearest neighborhoods selection. The computational complexity of the the selection is generally O(nDk) where kâ€‰<â€‰â€‰<â€‰n. Therefore, the complexity of calculating traditional centrality and marginality becomes 3â€‰Ã—â€‰O(nDk), in which 3 indicates the number of mappings. The computational complexity of computing weighted centrality and marginality depends on the number of samples, so the order becomes nâ€‰Ã—â€‰3â€‰Ã—â€‰O(nDk). The complexity of the gradual undersampling part is O(1) because it does not depend on the size of the problem. Finally, the computational complexity of the proposed multimanifold approach is O(D^{3})â€‰+â€‰O(nâ€‰Ã—â€‰3â€‰Ã—â€‰nDk)â€‰+â€‰O(1) in the worst case.
In Table 24, the average execution time of the proposed method in 5 experimental repetitions of 10fold CV is shown. The average execution time indicates that the execution time of the proposed method is consistent with the theoretical analysis of computational complexity and follows the polynomial time order.
Conclusions and future work
Class imbalance is an important issue that is tried to be handled in this paper. This issue can be solved via undersampling, and there are many undersampling strategies in the literature. This paper introduces a multimanifold learningbased technique to evaluate the importance of the data points. Different manifold learning strategies are used and assessed using a criterion based on information loss. Three linear unsupervised manifold learning methods are used in order to avoid high computing complexity. The traditional centrality and marginality degrees of the samples are computed on the manifolds and weighted by the corresponding score after computing the optimality score of each manifold. The suggested method of gradual removal attempts to balance the classes without causing the F measure to decrease on the validation dataset. The proposed approach is assessed on 22 imbalanced datasets from the KEEL and UCI repositories with different and considerable imbalance ratios using various classification metrics. The findings show that the proposed method outperforms other comparable approaches, especially on highly imbalanced problems.
The weakness of the proposed method is that if the number of minority class examples in a dataset is less than or equal to the number of features of that class, the multimanifoldbased approach cannot be implemented. LPP and NPE manifold learning methods must have a minimum number of samples to be applicable. Therefore, it may be desirable to use other mapping approaches that do not have this constraint. Also, the proposed multimanifold method performs poorly when the overlap of the classes increases. On the other hand, when the number of samples (i.e. n) increases, for example, nâ€‰>â€‰5000, the execution time increases dramatically. Therefore, for a large number of samples, more powerful hardware may be required. Supervised nonlinear manifold learning methods including Neighborhood Components Analysis (NCA), Maximally Collapsing Metric Learning (MCML) and LargeMargin Nearest Neighbor Metric Learning (LMNN) were omitted due to computational complexity and more execution time. Some other limitations of the proposed method are:

Manifold learning methods such as LLE and IsoMap, do not produce outofsample mapping matrix.

Manifold learning methods such as Factor Analysis (FA) produce NaN values in the transformation matrix.

Manifold learning methods such as Locally Linear Embedding (LLE) and IsoMap reduce the number of samples after mapping.
All these limitations led us to use unsupervised manifold learning methods such as PCA, LPP, and NPE.
However, the approach has some costs that are not negligible. The first weakness is the relative loss of precision. Precision reduction is unavoidable when we target to increase the classification rate of minority classes, but some future research is required to mitigate this reduction and maybe improve the classification measures much more. The second weakness is the relatively higher computational costs due to applying several manifold learning approaches and computing the degrees of centrality and marginality on each manifold. Although the experiment tries to reduce the computational costs and increase the applicability of the approach by applying three linear unsupervised manifold learning approaches, further improvements are necessary in this regard.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
HoyosOsorio J, et al. Relevant information undersampling to support imbalanced data classification. Neurocomputing. 2021;436:136â€“46.
Koziarski M. CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for Imbalanced Data Classification. in 2021 International Joint Conference on Neural Networks (IJCNN). 2021.
Tran TC, Dang TK. Machine Learning for Prediction of Imbalanced Data: Credit Fraud Detection. in 2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM). 2021.
Yan M, et al. A lightweight weakly supervised learning segmentation algorithm for imbalanced image based on rotation density peaks. KnowlBased Syst. 2022;244: 108513.
Yeung M, et al. Unified Focal loss: Generalising Dice and cross entropybased losses to handle class imbalanced medical image segmentation. Comput Med Imaging Graph. 2022;95: 102026.
Lin YD, et al. Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection. IEEE Access. 2022;10:15247â€“60.
Shahraki A, et al. A comparative study on online machine learning techniques for network traffic streams analysis. Comput Netw. 2022;207: 108836.
Ghorbani M, et al. RAGCN: Graph convolutional network for disease prediction problems with imbalanced data. Med Image Anal. 2022;75: 102272.
Ning Z, et al. BESS: Balanced evolutionary semistacking for disease detection using partially labeled imbalanced data. Inf Sci. 2022;594:233â€“48.
Zhao H, et al. Severity level diagnosis of Parkinsonâ€™s disease by ensemble Knearest neighbor under imbalanced data. Expert Syst Appl. 2022;189: 116113.
Xu Z, et al. A clusterbased oversampling algorithm combining SMOTE and kmeans for imbalanced medical data. Inf Sci. 2021;572:574â€“89.
Liu J. A minority oversampling approach for fault detection with heterogeneous imbalanced data. Expert Syst Appl. 2021;184: 115492.
Xie X, et al. A novel progressively undersampling method based on the density peaks sequence for imbalanced data. KnowlBased Syst. 2021;213: 106689.
Fattahi M, et al. Improved costsensitive representation of data for solving the imbalanced big data classification problem. J Big Data. 2022;9(1):1â€“24.
Fattahi M, et al. Locally alignment based manifold learning for simultaneous feature selection and extraction in classification problems. KnowlBased Syst. 2023;259:110088.
Galar M, et al. A Review on Ensembles for the Class Imbalance Problem: Bagging, Boosting, and HybridBased Approaches. IEEE Trans Syst Man Cybern. 2012;42(4):463â€“84.
Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining. 2009.
Chawla NV. Philip Kegelmeyer, SMOTE: Synthetic Minority Oversampling Technique. J Artif Intell Res. 2002;16:21â€“357.
Wang B. Imbalanced data set learning with synthetic samples. in: Proc. IRIS Machine Learning Workshop, 2004. 19.
Chawla NV, Hall LO, Bowyer KW. Smoteboost: improving prediction of the minority class in boosting. in: European Conference on Principles of Data Mining and Knowledge Discovery, Springer. 2003: p. 107â€“119.
JimenezCastaÃ±o C, OrozcoGutierrez A. Enhanced automatic twin support vector machine for imbalanced data classification. Pattern Recogn. 2020;89:107442.
Li F, Zhang X, Du C, Xu Y, Tian YC. Costsensitive and hybridattribute measure multidecision tree over imbalanced data sets. Inf Sci. 2018;422:242â€“56.
Sun Z, et al. A novel ensemble method for classifying imbalanced data. Pattern Recogn. 2015;48(5):1623â€“37.
Barandela R, SÃ¡nchez JS. New applications of ensembles of classifiers. Pattern Anal Appl. 2003;6(3):245â€“56.
Seiffert C, Van Hulse J, Napolitano A. Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Hum. 2010;40(1):185â€“97.
Mani I. Knn approach to unbalanced data distributions: a case study involving information extraction. In: Proc. of International Conference on Machine Learning, Workshop Learning from Imbalanced Data Sets, 2003. 126.
Kubat M. Addressing the curse of imbalanced training sets:onesided selection. in: Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA; 1997: p. 179â€“186.
Laurikkala J, Barahona P, Andreassen S (Eds). Improving identification of difficult small classes by balancing class distribution. In: Artificial Intelligence in Medicine, 2001: p. 63â€“66.
Kang Q, Chang X, Li S, Zhou M. A noisefiltered undersampling scheme for imbalanced classification. IEEE Trans Cybern. 2017;47(12):4263â€“74.
Chen C. Clusteringbased binaryclass classification for imbalanced data sets. in: Proceedings of 2011 IEEE International Conference on Information Reuse and Integration, IEEE, Las Vegas, NV, USA, 2011: p. 384â€“389.
Lin WC, Hu YH, Jhang JS. Clusteringbased undersampling in classimbalanced data. Inform Sci. 2017;409â€“410:17â€“26.
Yen SJ. Clusterbased undersampling approaches for imbalanced data distributions. Expert Syst Appl. 2009;36(3):5718â€“27.
Tomek I. Two modifications of CNN. IEEE Trans Syst Man Cybern A Syst Hum. 1976;6(11):769â€“72.
Hart P. The condensed nearest neighbor rule. IEEE Trans Inform Theory. 1968;14(3):515â€“6.
Tomek I. An experiment with the edited nearestneighbor rule. IEEE Trans Syst Man Cybern A Syst Hum. 1976;6(6):448â€“52.
Yang L, et al. Natural neighborhood graphbased instance reduction algorithm without parameters. Appl Soft Comput. 2018;70:279â€“87.
Hamidzadeh J, Monsefi R, Yazdi HS. LMIRA: Large Margin Instance Reduction Algorithm. Neurocomputing. 2014;145:477â€“87.
Pang X, Xu C, Xu Y. Scaling KNN multiclass twin support vector machine via safe instance reduction. KnowlBased Syst. 2018;148:17â€“30.
Hamidzadeh J, Kashefi N, Moradi M. Combined weighted multiobjective optimizer for instance reduction in twoclass imbalanced data problem. Eng Appl Artif Intell. 2020;90: 103500.
Deng X. IEEE 35th International Performance Computing and Communications Conference. IPCCC, IEEE. 2016;2016:1â€“8.
Ofek N, Stern R, Shabtai A. FastCBUS: A fast clusteringbased undersampling method for addressing the class imbalance problem. Neurocomputing. 2017;243:88â€“102.
Zhang X. Unbalanced data classification algorithm based on clustering ensemble undersampling. Comput Sci. 2015;42(11):63â€“6.
Ng WWY, Yeung DS, Yin S, Roli F. Diversified sensitivitybased undersampling for imbalance classification problems. IEEE Trans Cybern. 2015;45(11):2402â€“12.
Hamidzadeh J, Monsefi R, Sadoghi Yazdi H. IRAHC: Instance reduction algorithm using hyperrectangle clustering. Pattern Recogn. 2015;48(5):1878â€“89.
Huang ZA, et al. A neural network learning algorithm for highly imbalanced data classification. Inform Sci. 2022;612:496â€“513.
Koziarski M. Radialbased undersampling for imbalanced data classification. Pattern Recogn. 2020;102:107262.
Sun B, et al. Radialbased undersampling approach with adaptive undersampling ratio determination. Neurocomputing. 2023;553: 126544.
Mayabadi S, Saadatfar H. Two densitybased sampling approaches for imbalanced and overlapping data. KnowlBased Syst. 2022;241: 108217.
Vuttipittayamongkol P, Elyan E. Neighbourhoodbased undersampling approach for handling imbalanced and overlapped data. Inf Sci. 2020;509:47â€“70.
Nwe MM, Lynn KT. KNNBased Overlapping Samples Filter Approach for Classification of Imbalanced Data. In: Lee R, editor. Software Engineering Research, Management and Applications. Cham: Springer International Publishing; 2020. p. 55â€“73.
Zhai J, Qi J, Shen C. Binary imbalanced data classification based on diversity oversampling by generative models. Inf Sci. 2022;585:313â€“43.
Chen HE, Weiqi L, Jane W. A Low complexity quantum principal component analysis algorithm. arXiv, 2021.
ShiJie Pan LCW, HaiLing L, YuSen W, SuJuan Q, QiaoYan W, Fei G. Quantum algorithm for Neighborhood Preserving Embedding. arXiv, 2021.
Acknowledgements
Not applicable.
Funding
The authors declare that no funding was received for this research.
Author information
Authors and Affiliations
Contributions
TF implemented the approach and did the analyses and also prepared the initial draft. MHM proposed the original idea and supervised the process and proofread the manuscript. HT is the advisor and helped completion of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Feizi, T., Moattar, M.H. & Tabatabaee, H. A multimanifold learning based instance weighting and undersampling for imbalanced data classification problems. J Big Data 10, 153 (2023). https://doi.org/10.1186/s40537023008322
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537023008322