A multi-manifold learning based instance weighting and under-sampling for imbalanced data classification problems

Feizi, Tayyebe; Moattar, Mohammad Hossein; Tabatabaee, Hamid

doi:10.1186/s40537-023-00832-2

Research
Open access
Published: 06 October 2023

A multi-manifold learning based instance weighting and under-sampling for imbalanced data classification problems

Tayyebe Feizi¹,
Mohammad Hossein Moattar¹ &
Hamid Tabatabaee¹

Journal of Big Data volume 10, Article number: 153 (2023) Cite this article

981 Accesses
Metrics details

Abstract

Under-sampling is a technique to overcome imbalanced class problem, however, selecting the instances to be dropped and measuring their informativeness is an important concern. This paper tries to bring up a new point of view in this regard and exploit the structure of data to decide on the importance of the data points. For this purpose, a multi-manifold learning approach is proposed. Manifolds represent the underlying structures of data and can help extract the latent space for data distribution. However, there is no evidence that we can rely on a single manifold to extract the local neighborhood of the dataset. Therefore, this paper proposes an ensemble of manifold learning approaches and evaluates each manifold based on an information loss-based heuristic. Having computed the optimality score of each manifold, the centrality and marginality degrees of samples are computed on the manifolds and weighted by the corresponding score. A gradual elimination approach is proposed, which tries to balance the classes while avoiding a drop in the F measure on the validation dataset. The proposed method is evaluated on 22 imbalanced datasets from the KEEL and UCI repositories with different classification measures. The results of the experiments demonstrate that the proposed approach is more effective than other similar approaches and is far better than the previous approaches, especially when the imbalance ratio is very high.

Introduction

Imbalanced learning is one of the main challenges of classification in real-world problems. This challenge occurs when the number of examples from one class (called the majority class) is greater than the number of samples from the other class (called the minority class). Imbalance problem may be inevitable and happens when it is difficult to collect minority class samples, and the majority class samples are more abundant [1, 2]. Classification problems such as fraud detection [3], image segmentation [4, 5], intrusion detection [6, 7], disease detection [8,9,10], etc. are mostly imbalanced. Dealing with this issue is challenging because traditional classification approaches have presuppositions. Their default is that the training samples are equally distributed among the classes. Therefore, the majority class prevails over the minority class, and the minority examples are ignored. The inherent characteristics of imbalanced problems, such as overlapping and inseparability of classes [11, 12], increase the complexity of such data and weaken the classification performance [1, 2, 13].

Big data is the term used to describe datasets that are very huge, complicated, and contain a tremendous amount of information. In this kind of enormous dataset, the minority class may nevertheless be represented by a sizable number of examples. Because there are a lot of minority class samples even if they only make up a small fraction of the total dataset, handling imbalance becomes more difficult. In large data analytics, dealing with class imbalance becomes essential because disregarding it may result in biased model training. Large dataset processing and machine learning model training can be time- and resource-intensive. By balancing the classes, the training process can be improved, becoming more efficient and controllable.

Many studies have been done on the imbalanced data problem, and various techniques have been proposed. These techniques are divided into four main categories, which are algorithm-driven, cost-sensitive approaches, data-driven, and ensemble approaches. Algorithm-based methods try to adapt classifiers to imbalanced problems. These approaches try to modify the learning stage and accept the issue of imbalance in the data. In cost-sensitive methods, higher penalties are imposed for misclassifying minority samples, and these methods try to minimize the final penalty [14, 15].

Unlike other techniques, data-driven methods do not depend on classification and operate completely independently. These methods are usually done in the preprocessing stage. These methods benefit from two under- or over-sampling techniques or both and try to create a relative balance on imbalanced data. It seems that under-sampling techniques are more popular than over-sampling techniques because over-sampling techniques cause over-fitting. In ensemble approaches, several classifications are used simultaneously, and learning is done with the help of a voting technique or by combining the scores of the classifications. There is still a fundamental challenge with this type of approaches. The challenge here is how to combine the optimal classifications to increases the learning time [1, 2, 13].

Due to justifications such as applicability, generalizability and classifier independence, in this paper, an under-sampling approach is proposed. Two basic problems of under-sampling techniques can be pointed out, which are still a challenge among researchers. The main problems are: how many and which samples should be removed from the majority class? In this research, it is tried to overcome these two problems by introducing a new under-sampling method. The proposed method is based on the hypothesis that manifolds are the structures that can reflect the density and neighborhood properties of data. But since we are not sure that which manifold best suits a specific problem and dataset, a multi-manifold learning is proposed in this paper, which assesses the optimality of manifolds based on a proposed information loss-based heuristic.

The optimality indexes are used in a weighted combination of centrality and marginality criteria for the samples. The proposed approach is supposed to assign weights that determine the degree of importance of the samples from the majority class. A sequence of weights is created according to the relative importance of the samples. Then, the most insignificant samples are gradually removed from the majority class. Finally, combining the most important data of the majority class with the samples of the minority class, the training dataset is created. The proposed method is evaluated on 22 imbalanced datasets from KEEL and UCI repositories. The main contributions of this research are summarized as follows:

The samples of the majority class are weighted according to the multi-manifold approach, based on a weighted combination of the centrality and marginality of the samples on each manifold.
The weights are sorted in descending order. Less important samples are gradually removed from the majority class. The remaining samples can largely represent the distribution of the data.

The proposed approach reduces the overlap of minority and majority classes and increases class separability, which causes better classification performance. A simplified graphical abstract of the proposed method is shown in Fig. 1.

The rest of the paper is organized as follows: The next section gives a brief overview of under-sampling methods and related works. "Definitions and background" section introduces some definitions and the required background for better understanding the proposed approach. "Proposed method" section explains the proposed method in detail. The experimental results and discussions are in "Experimental Results and discussion" section. Finally, in the last section, the conclusions and the future research directions are discussed.

Related works

Under-sampling and over-sampling methods are performed in the preprocessing stage. In oversampling methods, minority samples increase. In these methods, new samples are usually created around the minority samples, or the minority samples are repeated again. Increasing the sample can cause an overfitting problem [16].

Researchers introduced ensemble approaches based on bagging, such as the Over-bagging method [17]. Various versions of the SMOTE technique, which is an oversampling method, are presented [18]. The SMOTE technique synthesizes a number of new samples with the help of k nearest neighbors by randomly choosing a neighbor among the minority samples for each sample. Along with its advantages, the SMOTE algorithm will include problems such as overgeneralization and variation in convergence [19]. Researchers presented the SMOTE-Bagging [17] and SMOTE-Boost [20] methods, which are a combination of SMOTE, bagging, and boosting. These methods have a high level of computational complexity, although they are significant in terms of performance. In addition, these methods cause overfitting and have many parameters to adjust [21, 22].

There are numerous under-sampling methods in the literature. Random under-sampling deals with the random removal of majority class samples. This may remove useful instances. This method is combined with ensemble approaches [23]. The under-bagging method is a combination of the ensemble method based on bagging and the random under-sampling method [24]. Seiffert et al. presented the RUS-Boost method, which is a combination of random under-sampling and boosting approaches [25]. Researchers proposed de-under-sampling methods called Near Miss, which consider the elimination of majority samples according to their distance from minority samples [26].

Under-sampling methods can be divided into two main groups, including methods based on KNN (k nearest neighbors) [27,28,29] and methods based on k-means [30,31,32]. Some under-sampling methods eliminate the majority of samples based on the information obtained from the nearest neighbors of the samples. The purpose of these methods is to remove samples that are located in marginal areas or are noisy and redundant. Kubat and Matwin [27] presented an under-sampling method called the One Side Selection method (OSS), which is one of the applications of Tomek links [33]. The samples in Tomek's links are considered marginal or noise samples. In the condensed nearest neighbor (CNN) method, if the label of a sample is the same as the label of its nearest neighbor (1NN), this sample is considered redundant [34]. In the OSS method, a large number of majority samples that are borderline, noisy, or redundant are removed. Removing a large number of samples reduces the performance of the classifier.

Laurikkala proposed the Neighborhood Cleaning Rule (NCL) to remove the samples of the majority class [28]. This method uses the Edited Nearest Neighbor (ENN) method [35] to eliminate the samples. In ref. [28], samples whose marginal score is more than two are eliminated. Also, samples are removed if one of the three nearest neighbors is from the minority class. In ref. [13], an under-sampling method is proposed that uses the density of the data to progressively remove data points from the majority class. Two factors are proposed to measure the degree of importance degree of each instance. Furthermore, the optimal under-sampling level is determined progressively.

In addition to eliminating the majority samples, Kang and his associates [29] also eliminated the noise in the minority class. They separated the minority samples into three groups: noisy, informative, and relatively informative. As a result, the classifier's performance will improve after the noisy minority samples have been eliminated. The exclusion of minority samples from the algorithm could lead to the failure of the classifier, which makes figuring out the value of the parameter k crucial.

Yang et al. proposed an under-sampling method that uses the natural neighborhood graph (NaNG). With the help of this graph, they are able to classify the training samples into central, marginal, and noise samples. They are able to under-sample by removing noisy and redundant samples. They called their sample reduction method NNGIR. One of the strengths of their methods is that they are non-parametric, increasing the reduction rate and improving prediction accuracy. The disadvantages of their method are the dependence on parameters and relatively low accuracy [36]. Hamidzadeh et al. [37] presented the LMIRA under-sampling method. They removed the non-marginal samples and kept the marginal samples. They considered their method a constrained binary optimization problem and used the filled function algorithm to solve it.

Pang and his colleagues [38] introduced a new secure under-sampling method called SIR-KMTSVM. In ref. [38], most redundant samples are removed from both the majority and minority classes. One of the advantages of their method is its use for large-scale problems. The disadvantages include high computational complexity and the removal of informative samples. Hamidzadeh and colleagues [39] proposed an under-sampling method that solves the instance reduction problem as an unconstrained multi-objective optimization problem. They designed a weighted optimizer and searched for the appropriate samples with the help of chaotic krill-herd evolutionary algorithm. The advantage of the method is the improvement in accuracy, geometric mean, and calculation time. The main weakness of the method is that it can only be applied to normal-sized datasets.

Another common under-sampling technique is clustering, which helps to have a logical training set [40]. Researchers named Chen and Shiu [30] put the majority of samples in k different clusters and used the k-means clustering method. Then, by combining each cluster from the majority class with the minority samples, they created new data groups that are more balanced. Each data group is trained separately and builds a classification model. Finally, all models are aggregated together to predict new samples. The weakness of this algorithm is that it has not determined how the value of the parameter k is determined.

Yen and Lee [32] used clustering to propose their under-sampling method. They identified representatives of the majority class to create new training data. They first divided the entire training data into k clusters. They performed clustering based on a ratio of majority samples to minority data. The weakness of the algorithm is that it does not specify how to value the parameters k and m. Lin and his colleagues [31] presented a new under-sampling method that clusters the majority samples with the k-means method, where the value of k is equal to the number of minority samples.

In addition to the mentioned methods, some researchers [41,42,43] used k-means clustering before applying the under-sampling method to determine the type of the majority sample in terms of noise, redundancy, or marginality. Hamidzadeh and his colleagues [44] introduced an under-sampling method based on hyper-rectangle clustering and called their method IRAHC. They removed the central samples and kept the marginal and near-marginal samples.

Huang and colleagues [45] introduced a neural network algorithm (NN_HIDC) for the classification problem of highly imbalanced data. They proposed the generalized gradient descent algorithm. This algorithm is used in re-sampling and re-weighting methods in neural network. They extended the locally controllable bound to reduce the insufficient empirical representation of the positive class. The advantages of this algorithm can be mentioned in its use for any very imbalanced data, and the weaknesses of the algorithm are that the extended gradient of the positive class can only reach the local border and gradient measurement is required for all samples in each iteration. Koziarski [46] introduced a radial under-sampling (RBU) method in the classification problem of imbalanced data. RBU uses the concept of mutual class potential in the under-sampling method. The advantages of the method include reducing the time complexity compared to RBO, being effective on the difficult minority class that includes small disjunct, outliers and small number of samples, overcoming the limitations of neighborhood-based methods.

Sun et al. [47] introduced a radial under-sampling approach with adaptive under-sampling ratio determination. They called their algorithm RBU-AR. This method determines the appropriate under-sampling ratio according to the complexity of the class overlap and does not use the default value of one or trial and error. The advantages of their approach are better performance in high overlap. The weakness of their approach is the lack of application in multi-class problems. Mayabadi and colleagues [48] proposed two density-based algorithms to remove overlap, noise and balance between classes. The first algorithm uses under-sampling technique and the second algorithm uses under-sampling and over-sampling simultaneously. Their method removes high density samples from the majority class. The advantages of their algorithms are maintaining the class structure as much as possible and improving performance.

Vuttipittayamongkol and Elyan [49] proposed an under-sampling method to solve binary data imbalance problem by removing overlapping data. They focused on the detection and elimination of overlapping majority samples. The advantages of their algorithm are preventing information loss and improving the sensitivity criterion. The weak points of the algorithm are how to set the value of k in the k-NN law and the failure to examine multi-class problems. Nwe and colleagues [50] introduced an effective under-sampling method using k-nearest-neighbor-based overlapping samples filter to classify imbalanced and overlapping data. The advantage of their algorithm is to prevent information loss. The weak points of their algorithm are setting the value of k, not checking high dimensions and not checking multi-class problems.

Zhai and colleagues [51] proposed two diversity over-sampling methods, BIDC1 and BIDC2, which were based on generative models. BIDC1 and BIDC2 methods use extreme machine autoencoder and generative adversarial network, respectively. Among the advantages of their methods, we can mention the simple but effective idea, improving performance in data with low and high imbalance ratio, suitable for different practical scenarios, creating variety in over-sampling and preventing overlap of classes. The weaknesses of their method are the lack of scalability in big data and the difference between the original and generated data distribution. Table 1 summarizes of the strengths and weaknesses of the some of the main approaches in this regard.

Table 1 Summary of the strengths and weaknesses of the most important reviewed articles

A multi-manifold learning based instance weighting and under-sampling for imbalanced data classification problems

Abstract

Introduction

Related works

Definitions and background

Definitions

Manifold learning

Principal component analysis (PCA)

Neighborhood preserving embedding (NPE)

Locality preserving projection (LPP)

Proposed method

The logic of the proposed approach

Multi-manifold learning approach

Manifolds optimality determination

Weighted combination of centrality and marginality in the multi-manifold approach

Gradual under-sampling of data

Experimental results and discussion

Datasets

Experimental setup

Evaluation criteria

Simulation results

Multi-manifold approach with reduction step of 5 percent

Multi-manifold approach with reduction step of 10 percent

Comparison of single-manifold and multi-manifold approaches

Comparison with other under-sampling approaches

Comparison with state-of-the-art under/over sampling approaches

Statistical analysis by Wilcoxon test

Evaluation and discussions on kddcup network intrusion detection dataset

Evaluations on artificial datasets

Discussion on marginality and centrality criteria

Computational complexity analysis

Conclusions and future work

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords