Graph-Based Automatic Feature Selection for Multi-Class Classification via Mean Simplified Silhouette

This paper introduces a novel graph-based filter method for automatic feature selection (abbreviated as GB-AFS) for multi-class classification tasks. The method determines the minimum combination of features required to sustain prediction performance while maintaining complementary discriminating abilities between different classes. It does not require any user-defined parameters such as the number of features to select. The methodology employs the Jeffries-Matusita (JM) distance in conjunction with t-distributed Stochastic Neighbor Embedding (t-SNE) to generate a low-dimensional space reflecting how effectively each feature can differentiate between each pair of classes. The minimum number of features is selected using our newly developed Mean Simplified Silhouette (abbreviated as MSS) index, designed to evaluate the clustering results for the feature selection task. Experimental results on public data sets demonstrate the superior performance of the proposed GB-AFS over other filter-based techniques and automatic feature selection approaches. Moreover, the proposed algorithm maintained the accuracy achieved when utilizing all features, while using only $7\%$ to $30\%$ of the features. Consequently, this resulted in a reduction of the time needed for classifications, from $15\%$ to $70\%$.


Introduction
Feature selection is a crucial step in the process of developing effective machine-learning models.Selecting the most relevant features from a dataset helps to reduce model complexity, prevent overfitting, and improve model interpretability and performance [1].In recent years, with the explosion of big data, feature selection has become an increasingly important technique in machine learning, as it can significantly reduce the time and resources required for model development and at the same time maintain prediction accuracy [2].
The rise of big data presents distinctive challenges and opportunities.In today's digital era, the pace of data generation has accelerated, particularly in fields such as genomics, where both the complexity and volume of data have significantly increased [3].Effectively managing such extensive and complex data requires preprocessing methods.Feature selection emerges as a pivotal technique in this context, not only simplifying the model by reducing the number of parameters but also enhancing the performance and accuracy of data classification [4].It plays an instrumental role in addressing the curse of dimensionality, which is a significant challenge in big data environments, by selecting features that are most relevant and non-redundant, thereby preserving the essence of the original data while discarding the redundant [5].
The main goal of feature selection is to find the optimal k-sized subset of features that accurately represents the input data [6].This technique aims to reduce the impact of irrelevant variables and noise, maintaining prediction accuracy [7].There are three main types of feature selection methods: wrapper, embedded, and filter.Each type employs distinct strategies for selecting features and comes with its unique benefits and drawbacks [8].
Wrapper methods select feature subsets based on their performance in predictive models, often leading to highly accurate and model-specific feature sets [9].Their major drawback is computational intensity, as they evaluate numerous feature combinations through repeated model training and validation, making them less feasible for largescale applications and datasets, or rapid prototyping [6].
Embedded methods integrate feature selection directly into the model training process, typically using regularization techniques to penalize less relevant features [10].While this approach is efficient and can enhance model performance, it results in feature selections that are closely tied to the specific learning algorithm, potentially limiting their generalizability to other models or data scenarios and may result in overfitting [11].
Filter methods rank features by their statistical characteristics or relevance to the target variable, offering a model-independent, faster, and more efficient approach, particularly for large datasets [12].However, these methods come with certain limitations.Firstly, the selection process in filtering methods, typically based on the highest score of a feature according to the method's index, may overlook the interrelationships between features [13].This can result in selecting features with similar rather than complementary abilities.Secondly, most of the filter-based methods rely on user input to determine the size of the feature subset, presuming the user has prior knowledge of the data, which often leads to a trial-and-error approach.Thirdly, as these methods are independent of the learning model, the user needs to evaluate the quality of the selected subset of features, so he needs to perform multiple runs of different classifiers with different calibrations, in order to choose a quality final solution.
This paper proposes a novel filter-based feature selection method for multi-class classification tasks that addresses these shortcomings.Our approach generates a new feature space, defined by how well each feature differentiates between class pairs.Instead of merely picking features with the highest separation scores, common in other filtering-based approaches, our method aims to select a group of features with complementary discrimination capabilities.This is achieved using the K-medoids algorithm in a low dimensional space, which preserves the inherent nonlinear structure of the original feature space, allowing the selection of features from various parts of the space.Additionally, in contrast to conventional filter methods that require predefined input regarding the subset size of features, our approach stands out by its inherent capability to autonomously discover the minimal combination of features necessary for preserving the overall prediction performance when using the entire feature set, using a newly developed Mean Simplified Silhouette (MSS) index.The MSS index, which is based on the Simplified Silhouette (SS) index [14], evaluates clustering outcomes in the context of feature selection for classification problems.It assesses the effectiveness of the selected subset of features obtained from clustering results, aiming to select features that spans the entire feature space, i.e. with complementary separation capabilities.We demonstrate a strong correlation between the MSS index values and accuracy results across a variety of datasets that differ in size and characteristics, as well as across different classifiers.Leveraging this correlation, we can evaluate the quality of the selected subset of features with the MSS index only, without the need to run classifiers, thus saving significant run-time.The major contributions of this work include: • A graph-based feature selection method is proposed to identify the minimum set of k features required to preserve the accuracy of predictions when using the entire feature set.• An agnostic methodology that is independent of specific statistical measures or dimensionality reduction techniques for assessing the features' ability to distinguish between classes.• A novel Silhouette-based index is developed to evaluate clustering outcomes in the context of feature selection for multi-class classification problems.• The effectiveness and superior performance of our approach compared to state-ofthe-art filtering methods are demonstrated via an experimental analysis.
The remainder of this paper is organized as follows.The next section gives a brief overview of related works.Section Definitions and background introduces some definitions and the required background for a better understanding of the proposed method and experiments.Section Proposed method outlines the proposed method in detail.The experimental results and discussions are in Section Experimental results.Finally, in Section Conclusion and future work, the conclusion and future research directions are discussed.

Related works
Recent advancements in filter-based feature selection have explored the use of graphbased techniques.Graph-based techniques have emerged as a promising field due to their advantages, such as improved interpretability, capturing complex relationships between features [15], and the potential to handle high-dimensional data more effectively [16].These methods involve constructing a graph that captures pairwise relationships between features in the data, while also considering their relevance and redundancy.These methods rank the features based on specific criteria and select the k features with the highest score.Building on these developments, Briola et al. [17] introduced an innovative unsupervised, graph-based filter feature selection technique leveraging topologically constrained network representations.This approach incorporates a selection strategy that prioritizes the top k features, refining the process of feature selection.A main drawback of these methods, however, is that selecting features with the highest score may result in selecting features with identical characteristics that do not cover the entire feature space, leading to a loss of information.
Friedman et al. [18] proposed a potential solution to the aforementioned limitation through a filter-based feature selection technique that utilizes diffusion maps [19].This method encompasses the entire feature space by constructing a new feature space based on the features' separation capabilities.It then selects features with complementary separation capabilities that cover the entire feature space.Similarly, Amin et al. 's Multilabel Graph-Based Feature Selection (MGFS) method [20], which also addresses the same problem, employs the PageRank algorithm [21] for efficient feature selection in multi-label data.MGFS constructs a graph with features as nodes linked according to their similarity, which is measured using a correlation distance matrix.The significance of each feature is assessed using PageRank, simplifying the identification of key features in complex datasets.Likewise, Parlak et al. [22] proposed an Extensive Feature Selector (EFS) method utilizes class-based and corpus-based probabilities to select distinctive features for text classification.It incorporates clustering by calculating both corpus-based and class-based probabilities separately, aiming to choose more distinctive features.Collectively, these methods share the goal of selecting features with complementary capabilities to provide a comprehensive understanding of the feature space.
A major drawback of filter-based feature selection techniques is the necessity to determine the number of selected features k as an input parameter.Usually, this value is defined under the assumption that a minimal percentage of features encompasses all necessary information.This limitation presents a considerable challenge since the optimal k value can vary based on the data and the task, resulting in a trial-and-error approach that demands significant time and resources.Hence, there is an increasing demand for a filter-based feature selection algorithm that can automatically determine the minimal combination of features required to sustain prediction performance.
Roffo et al. [23] introduced Infinite Feature Selection (Inf-FS), which addresses this specific challenge.This method represents features as paths in a graph, with nodes denoting features and edges signifying their relevance and non-redundancy.It utilizes matrix power series and Markov chains for ranking, and a clustering algorithm subsequently selects the final feature set based on these rankings.A notable drawback of this method, is the requirement to fine-tune the α parameter, which balances feature relevance and diversity.Different values of the α parameter affect the scoring of the features and the priority given to the selected features, meaning that different α values may lead to completely different final results.Addressing the same limitation, Thiago et al. [24] introduced a filter-based algorithm named Supervised Simplified Silhouette Filter ( S 3 F ) that employs the Simplified Silhouette (SS) [14,25] index to overcome this specific limitation.This method requires the user to define a search range by determining both a minimum and maximum value for k.The method then proceeds to identify the optimal k within this specified range.This approach exhibits two primary shortcomings.Firstly, if the user selects a minimum and maximum k that fail to encompass the optimal k value, the latter is not identified.Secondly, although the SS index is effective for assessing clustering quality, it is not designed to evaluate clustering quality for feature selection tasks in classification problems.Although the two methods mentioned above inherently determine the feature subset size within their frameworks, it's important to note that their performance is substantially influenced by user-defined parameters, which have a considerable impact on the outcome.
In summary, the analysis of existing literature reveals a significant need for a filterbased feature selection method, which not only focuses on choosing features with complementary discriminating abilities that cover the entire feature space but also determines the minimal size of the final feature subset.This should be achieved independently of user-defined parameters to facilitate full adaptability and automation of the feature selection process across a broader range of datasets.

Definitions and background
For a better explanation of the proposed method and experiments, in this section, some primary definitions and backgrounds are explained.

Notation
Denote the learned dataset by (X, Y), where X is a N × M dataset, with N representing the number of samples and M representing the features' dimension.The label is stored in the N × 1 vector Y, which assumes C classes.The dataset X comprises M feature vectors, represented by F = {f 1 , ..., f M } , where each f i is of size 1 × N.

Jeffries-Matusita distance
In the experiments, we will employ the JM distance within the proposed method, as a statistical measure of the similarity between two probability distributions [26,27].The formulation for the JM distance, as adopted from the official documentation in the cited literature, is defined as follows.Given a feature f i ∈ F , we use the JM distance to construct a C × C matrix, JM i , which defines how well the feature f i differentiates between all pairs of classes.Specifically, the matrix entry JM i (c, c) indicates how well the feature f i differentiates between the two classes c and c , where 1 ≤ c, c ≤ C .The matrix entries are computed by: where: is the Bhattacharyya distance.The values µ i,c , µ i,c and σ i,c , σ i,c are the mean and variance values of two given classes c and c from the feature f i .

t-distributed Stochastic Neighbor Embedding
In the experiments, we will integrate t-SNE [28] into the proposed method as a nonlinear dimensionality reduction algorithm that maps high-dimensional data to a lowdimensional space while preserving local structure.For this paper to be self-contained, (1) we present the t-SNE method and formulations according to the cited literature.The t-SNE algorithm is an improvement over the original SNE (Stochastic Neighbor Embedding) [29] algorithm, providing more accurate and interpretable visualizations by mitigating the crowding problem and simplifying the optimization process [30].The t-SNE algorithm works by embedding points from a high-dimensional space R M into a lower-dimensional space R R , while preserving the pairwise similarities between the points ( R ≪ M ).Given a dataset of N points {u 1 , u 2 , ..., u N } ∈ R M in the high-dimensional space, the t-SNE algorithm aims to find a corresponding set of points {v 1 , v 2 , ..., v N } ∈ R R in the low-dimensional space that best reflects the similarities in the original space.
The algorithm defines pairwise conditional probabilities p j|i as the likelihood that point u j is u i 's neighbor in the high-dimensional space.These probabilities are defined as: where σ i is the variance of the Gaussian centered at point u i .The value of p j|i is influenced by the distance between points u i and u j , with closer points having higher probabilities.The algorithm defines a symmetric pairwise similarity p ij , which measures the similarity between points u i and u j in the high-dimensional space, defined as the average of the conditional probabilities p i|j and p j|i : The use of the symmetric pairwise similarity allows for a more balanced representation of similarities between points, mitigating the effects of differences in local densities.In the low-dimensional space, pairwise similarities between points v i and v j are defined as q ij : The t-SNE algorithm seeks to minimize the divergence between the distributions P and Q, which is measured by the Kullback-Leibler (KL) divergence: Minimizing the KL divergence ensures that the low-dimensional embedding preserves the pairwise similarities between points as accurately as possible.

Silhouette
In the proposed method, we develop a Silhouette-based index for evaluating the quality of clustering in the context of feature selection.The classical Silhouette index [31] is a metric used to evaluate clustering quality by measuring how similar a data point is to (3) (5) its own cluster compared to other clusters.It has a value between −1 and 1, indicating the level of separation between the clusters and the level of cohesion within each cluster.Specifically, the index calculates, for each point i, the average distance of the point from all other points in the same cluster, a(i), and the average distance of the point from all other points in the closest neighboring cluster, b(i).Thus, the Silhouette value for point i is computed as follows: where −1 indicates a data point closer to the neighboring cluster, 0 indicates a boundary point, and 1 indicates a data point that is much closer to the other points in the same cluster than to the points of the closest cluster.The Silhouette value of a full clustering is the average value of sil(i) across all data points.The Silhouette index, being computationally expensive and sensitive to outliers, prompted the development of the Simplified Silhouette (SS) index [14,25], a faster and more robust alternative.The SS index for a point i is computed as follows: where a(i) ′ is the distance of point i from the centroid of its own cluster and b(i) ′ is the distance of point i from the centroid of the nearest neighboring cluster (in this work, centroids replaced by medoids).The ss(i) value ranges from −1 to 1.Because at the end of K-means or K-medoids clustering, the distance of a data point to its closest neighboring cluster's centroid or medoid b(i) ′ is always greater than or equal to the distance to its own cluster's centroid or medoid a(i) ′ , the term max{a(i) ′ , b(i) ′ } can be simplified to b(i) ′ [25].Therefore, after executing the K-means or K-medoids algorithms, the SS value for a single point can also be simplified as follows: Similarly to the Silhouette index, the SS index is the average of the SS over all data points.

Kneedle algorithm
In the proposed method, we will employ the Kneedle algorithm as a selection tool to identify the minimal subset of k min features that can effectively classify different classes without experiencing a decline in performance.The Kneedle algorithm [32] is used to identify the points of maximum curvature in a given discrete dataset, commonly referred to as "knees".These knees are generally the set of points on a curve that represent local maxima if the curve is rotated by an angle of θ degrees clockwise about the point (x min , y min ) through the line that connects (x min , y min ) and (x max , y max ) points.The identified points are those that differ most from the straight-line segment connecting the (7) first and last data points, representing the points of maximum curvature for a discrete set of points.

Proposed method
We now present our graph-based filter method for automatic feature selection (GB-AFS).As explained above, the method determines the optimal subset of features that best represent the data, achieving an effective balance between model performance and computational efficiency.The overall architecture of our method is presented in Fig. 1, while Algorithm 1 outlines the specific steps for implementing the method.Sections Separability-based feature space, Clustering evaluation for feature selection using MSS, and Optimal k determination present the three stages of the proposed GB-AFS method.

Separability-based feature space
Our aim is to preprocess the input data and move them into a reduced feature space that retains the original feature space's ability to distinguish between each pair of classes.For this goal, we will initially create a new feature space Z , defined by the separability capa- bility of each feature with respect to every pair of classes.For each feature f i ∈ F , we compute a T i matrix of size C × C that captures the separation capabilities of each fea- ture with respect to all possible pairs of classes.This computation uses a statistical measure to assess the distance between two probability distributions.Specifically, the matrix entry T i (c, c) indicates how well the feature f i differentiates between the two classes c  and c , where to form the new feature space Z .Subsequently, to visualize and organize the separabil- ity characteristics of the features, we employ a nonlinear dimensionality reduction technique to obtain the new feature space Q.In Fig. 2, you can see an example of the construction of a new feature space from the Microsoft Malware Prediction dataset [33], by using Jeffries-Matusita (JM) distance as a statistical measure and t-distributed Stochastic Neighbor Embedding (t-SNE) as a nonlinear dimensionality reduction technique.You can find more details about the JM distance in Section Jeffries-Matusita distance, and the t-SNE technique in Section t-distributed Stochastic neighbor embedding, respectively.The dataset is characterized by a composition of 257 features segmented into 9 classes, hence forming a representation through 257 matrices, each of size 9 × 9 .A matrix is visualized in Fig. 2 as a 9 × 9 grid, where each entry represents the separation degree between two distinct classes.Pronounced separation is depicted in red, gradually transitioning to yellow as the separation narrows.

Clustering evaluation for feature selection using MSS
The GB-AFS aims to identify the minimal subset of k min features that retain the ability of the entire M features to separate and distinguish different classes.To identify this subset, we follow a two-step procedure for every k ∈ [2, M] to obtain a score reflecting the capability of a k-sized feature subset to represent the entire feature space's ability to separate and distinguish between different classes.
In the first step, features are selected using the K-medoids [34] algorithm, which selects features from different regions in the low-dimensional space, with complementary separation capabilities.The K-medoids algorithm, however, has a significant drawback in that it is sensitive to the initialization of centers.To mitigate this issue, we utilize the K-means++ [35] initialization algorithm, which initializes the algorithm more effectively by selecting the initial centers using a probability distribution based on the distances between data points.Since K-medoids is designed to minimize the sum of distances between features and the nearest medoids, the objective of the second step is to use a measure to evaluate the effectiveness of the k subset of features obtained from the K-medoids algorithm in representing the entirety of the feature space by also considering the separation between clusters.For this goal, we propose a new metric, named Mean Simplified Silhouette (MSS) index, which is a variation of the SS index that evaluates the clustering outcome in the context of feature selection in classification problems as explained in the next paragraph.By performing multiple runs of K-medoids for all values of k in the range [2, M], we can identify the minimal subset of k min features that most effectively represents the original feature set and maintain the classification performance when using the entire set of features.

Mean Simplified Silhouette
The goal of feature selection is to identify a subset of features that cover the entire feature space with complementary capabilities, while avoiding the selection of redundant features with the same separation capabilities as the chosen ones.Existing clustering evaluation indices, such as Silhouette and Simplified Silhouette as explained in Section Silhouette, typically evaluate how closely a point is associated with its own cluster or its cluster's medoid, and how distinct it is from the closest cluster to which it does not belong.Furthermore, those indices usually set the value of a point to zero [31,36] if it happens to be the sole point present in a cluster, which causes the indices to tend to zero as the number of clusters in the space approaches to the number of features in the space.However, when considering clustering outputs in the context of the feature selection task, our goal is to create an index that effectively measures the separation between a given point and all clusters to which it does not belong.This index aims to quantify the extent to which the chosen points encompass the entire feature space comprehensively.Moreover, it considers the proximity of each point within a cluster to its designated representative point, assessing the representative point's capability to capture the characteristics of the other points it represents.Furthermore, in the case where each point constitutes an individual cluster, our metric should reach its highest possible value, indicating complete coverage of the feature space.This approach aims to enhance the feature selection process by ensuring both comprehensive coverage and accurate representation within the feature space.
Our MSS index effectively addresses the limitations associated with both the Silhouette and SS indices.The proposed MSS calculates the distances from each point to all other cluster medoids within the feature space, excluding the medoid of the cluster to which the feature belongs.This modification enables a more reliable assessment of the selected features and ensures that they are sufficiently diverse and complementary to provide full coverage of the entire feature space.Additionally, unlike the other clustering evaluation indices, we exclude clusters that have only one feature in the MSS calculation.This exclusion is based on the rationale that isolating a feature which is significantly distant from its cluster's center to form a new, single-feature cluster should enhance the clustering outcome and increase the MSS index.Through these modifications, the MSS index becomes an appropriate index for assessing the performance of clustering algorithms within the context of feature selection.It guarantees an optimal arrangement of the feature space, facilitating the identification of a group of features that exhibit complementary characteristics.
The MSS index is calculated based on the distances between each feature and the medoids of each cluster, similar to the SS index.It differs from the Silhouette index, which calculates how close each feature in a cluster is to features in its own cluster compared to features in other clusters.We believe that our approach is preferable because it is more reasonable to calculate distances from the medoids, which are the representative features, rather than the excluded features.In addition, MSS reduces the computational complexity, i.e., when computing distances from the medoids, the computational complexity is estimated as O(kRM) , as opposed to O(RM 2 ) , when computing distances from all features within each cluster.This difference is significant when k is much smaller than the number of features in the feature space, M.
To compute the MSS index, we begin by defining the values a(i) and b(i) for every point i in the dataset.The value of a(i) corresponds to the distance d(•, •) between point i and the center of the cluster to which it belongs, C h , whereas the value of b(i) denotes the average distance of point i from the centers of all other clusters C l , l = h ; namely: where the MSS index is the average of the MSS coefficients over all data points.
Figure 3 presents the Silhouette, SS, and MSS values depicted as solid lines and the accuracy obtained by three classifiers depicted as dashed lines over different values of k.These results were obtained by executing the first two stages of GB-AFS (Fig. 1) on the Microsoft Malware Prediction dataset [33].It can be observed that as the value of k increases, both the Silhouette and SS values decrease rapidly toward zero, indicating an increase in single-point clusters.Conversely, the adjustments made to the MSS index enable the assessment of how effectively the clustering algorithm utilizes the feature space, for the feature selection problem.Moreover, a clear correlation was observed between the MSS index and the accuracy results of the classifiers, which emphasizes the effectiveness of the MSS index in solving feature selection tasks in classification problems.
It is important to note that our objective is not solely focused on pinpointing the k value that results in the highest accuracy value.Our objective is also to find the smallest k value that is sufficient for obtaining acceptable accuracy.Even if a higher k value may lead to marginally better accuracy, it may demand excessive resources and computational power, which would not be practical or worthwhile.That is why we (10) utilize the Kneedle algorithm developed by Satopaa et al. [32] as explained in the next section.It enables us to achieve a balance between accuracy and resource usage.

Optimal k determination
In this third stage, our goal is to determine the minimal subset of k min features that can classify different classes effectively without incurring a drop in performance.As presented in Section Clustering evaluation for feature selection using MSS, the MSS index exhibits a correlation with the accuracy results over all possible values of k.Thus, in this stage of the proposed GB-AFS method, illustrated in Fig. 1, we apply the Kneedle algorithm to the MSS graph to find the minimal subset of k min features.
Applying the Kneedle algorithm to the MSS graph enables the identification of the knee point, as illustrated in Fig. 3 by the vertical dashed line.This knee point corresponds to a specific k value representing the minimal number of features needed for classification.Subsequently, the k medoids associated with this k value are retrieved as the minimal subset of features required for classification.

Time Complexity of the GB-AFS method:
Constructing the separation matrix has a computational complexity of O(NM) .Dimension reduction on this matrix, in our experiments done using t-SNE, has a computational complexity of O(M 2 ) [28].In the clustering evaluation phase, for each potential feature subset size k , we perform K-medoids clustering with computational complexity of O(M 2 k) and evaluate cluster quality via the MSS index with a computational complexity of O(Mk) , leading to a total time complexity dominated by O(M 2 k) for a given k ; We are doing clustering ) .Then, the optimal k value is found using the Kneedle algorithm with a computational complexity of O(M) , which results with GB-AFS method's overall complexity of O(M 3 ).

Datasets
Datasets quality and relevance are crucial in the extensive data analysis field.We have carefully chosen five datasets originating from different domains to ensure broad coverage of diverse scenarios and challenges.Table 1 presents, for each dataset, the number of instances, the number of features, and the number of classes.Below is a short description of each dataset.To conduct a comprehensive analysis, these datasets were chosen specifically for their variability in several dimensions, including the number of features, samples, classes, and level of class imbalance.
Isolet is an imbalanced dataset of 617 voice recording features from 150 subjects reciting the English alphabet, with the goal of classifying the correct letter among the 26 classes.
Cardiotocography comprises 23 distinct assessments of fetal heart rate (FHR) and uterine contraction (UC) characteristics, as documented on cardiotocograms.They were categorized into 10 separate classes by experienced obstetricians.
Mice Protein Expression measures the expression levels of 77 proteins in the cerebral cortex of eight classes of mice undergoing context fear conditioning for evaluating associative learning.
Music Genre Classification is a dataset that offers 1000 labeled audio snippets, each lasting 30 s.With 197 distinct features characterizing each sample, it's a favored choice for machine-learning endeavors aimed at discerning among 10 varied music genres.In total, the dataset presents 1000 samples, categorized into 10 classes, detailed by 197 features.
Microsoft Malware Prediction presents 257 file attributes spread across 9 distinct malware identification classes.Its primary design intention is to serve as a foundation for constructing machine-learning models aimed at malware prediction.This data collection boasts 1642 samples, detailed with 9 classes and 247 features.

Parameter settings
Section Separability-based feature space outlines the first stage of our proposed method.This stage involves preprocessing the input data and transitioning it into a reduced feature space.This reduced space preserves the original features' capability to distinguish between each pair of classes.The GB-AFS method is designed as an agnostic solution that is not tied to any particular statistical measure or dimensionality reduction technique; it allows users to make these choices.In our experiments, to evaluate the difference between the two probability distributions, we employed the Jeffries-Matusita (JM) distance.Further details on this are provided in Section Jeffries-Matusita distance.Additionally, we select the t-SNE for the nonlinear dimensionality reduction technique.Further details on this are provided in Section t-distributed Stochastic neighbor embedding.Other class separability measures and nonlinear dimension reduction techniques may be used in the first stage of the GB-AFS method.In Section Method generalization, a sensitive analysis is conducted to evaluate the generalization of the GB-AFS method when employing a variety of dimensionality reduction techniques and statistical measures.
For K-medoids clustering, we used the PAM method for cluster assignment with K-means++ initialization.This method selects initial medoids farthest from each other, simulating the idea of choosing features with complementary separation capabilities, which improves the algorithm's efficiency and accuracy.

Baseline methods
The predictive efficacy of our GB-AFS method is evaluated against a total of six state-ofthe-art methods.
The first method is the MGFS, as introduced by Amin et al. [20].This method requires user input to determine the size of the feature subset.Here, we employ the k min value derived from our GB-AFS method.To execute the method, we utilized the author's official repository 1 .
The second method is the Inf-FS, proposed by Roffo et al. [23].The official repository2 for this method lacks the implementation of the minimal k-selection step, thus we again utilize the k min value from our GB-AFS method in this case.
The third method is the S 3 F , proposed by Thiago et al. [24].Unlike the two previous methods, S 3 F includes an internal mechanism for selecting the optimal k value during its execution, which means the chosen k may differ from our method.
The fourth method is the ReliefF method [41], which ranks the importance of each feature by measuring how well it distinguishes between instances of different classes while considering the proximity of those instances to each other.This method also requires the user input to determine the size of the feature subset, hence we use the k min value obtained from our method.
The fifth method is Minimum Redundancy Maximum Relevance (mRMR) method [42], which selects features that are both highly relevant to the target and minimally overlapping, optimizing for predictive power and information uniqueness.Determining the feature subset size in this method requires user intervention.Similar to some of the previously explained methods, we adopted the k min value obtained by our GB-AFS method.
The sixth method is Correlation-based feature selection (cFS) method [43], which identifies and selects features that are highly correlated with the target variable but minimally correlated with each other, aiming to improve performance by reducing redundancy.For this method, the user should also define the feature subset size as input, similar to previous methods.Thus, the k min parameter found by the proposed GB-AFS method, was employed.

Experiment method
To avoid features with large numerical ranges from dominating those with small numerical ranges, the data were rescaled to lie between 0 and 1 using the min-max normalization procedure.Then, we split the data randomly such that 75% of the instances (training dataset) were used for applying GB-AFS to determine the set of k min features and build the classifiers.The remaining 25% (test dataset) was used to evaluate the performance of the GB-AFS and resulting classifiers.
To find the k min , we evaluated the MSS over a validation dataset for each set of k fea- tures found by applying the GB-AFS on the training set, where k ∈ [2, M] .To reduce the bias when selecting the training and validation data, we used a five-fold cross-validation approach [44], where 80% of the dataset was used to identify the set of k features and the remaining 20% was used for MSS calculation.For each value of k , we calculated the MSS five times, each time using a different subset as the validation dataset.We then averaged these values to generate an averaged MSS graph.Next, we applied the Kneedle algorithm 3 to the averaged MSS graph to obtain the value of k min , which represents the number of features in the final dataset.
After determining k min , we applied GB-AFS to the entire training set to obtain the set of k min features and construct three different classifiers (KNN, Decision Tree and Ran- dom Forest) based on the chosen features.It should be noted that these classifiers were chosen randomly to enable the evaluation of our method in relation to other methods.The trained classifiers were then employed to classify instances in the out-of-sample test set and evaluate the accuracy and balanced F-score metrics.To evaluate the statistical significance of the results in comparison to the benchmarked methods, we repeated the entire experiment's methodology 10 times, using a different random split of the training-test sets for each iteration.

Experiment results
We evaluated the performance of our proposed GB-AFS method based on two key metrics: Accuracy and Balanced F-score.The performance of the classifiers was determined by calculating the average values over 10 test sets, as described in Sect.Experiment method.The accuracy and balanced F-score of the proposed GB-AFS were compared to ReliefF, mRMR, cFS, Inf-FS, and MGFS, using the same k value found by GB-AFS, since they take the number of selected features as an input parameter.The results of S3 F are obtained for the k value determined as part of the method.These results are reported in Table 2.For each dataset and classifier, the results of the best method are shown in bold.Since performance results were generated for all feature selection methods in each run among the 10 runs, a paired t-test [45] is employed to compare the result of the best feature selection method with the second-best method in each combination of dataset and classifier.Statistically significant differences at a p-value < 0.05 are indicated by underline.Across all combinations of datasets and classifiers, the GB-AFS shows an average accuracy improvement of 7.5% and an average balanced F-score improvement of 7.7% compared to the other methods.In 12 out of 15 combinations of datasets and classifiers, we observed that the GB-AFS method performs better in terms of accuracy compared to the other methods, showing improvements ranging from 1.6% to 4.2% , with an aver- age improvement of 2.8% compared to the second-best method in each combination.In 11 of these 12 combinations, GB-AFS outperformed the other filter methods with a statistically significant difference (p-value < 0.05 ).In terms of the Balanced F-score, we observe that the GB-AFS method achieved better results in 13 of the 15 combinations of datasets and classifiers, showing improvements ranging from 0.2% to 4.2% , with an aver- age improvement of 2.8% compared to the second best-method in each combination.Of these 13 combinations, 11 were found to be statistically significant.
Figure 4 displays the average accuracy of the proposed GB-AFS method with a ±95% confidence interval, in comparison to S 3 F method, which also automatically finds the Table 3 The running times and the percentage of time saved when using a set of k min features obtained by GB-AFS in comparison to the running times when using all features Fig. 4 Comparison of the Accuracy of various classifiers utilizing k features selected by the GB-AFS with the MSS index, k features selected by the S 3 F method and the complete feature set available in the dataset best k value according to their implementation.The results show that the GB-AFS method achieved significantly better accuracy than when incorporating the S 3 F method, with an average improvement of 12.7% .Moreover, the GB-AFS method selected between 7% and 30% of the features in each dataset, while in 14 of 15 combinations of datasets and classifiers, there was no statistically significant difference (p-value < 0.05 ) in accu- racy results between the GB-AFS method and when using all features.Although the accuracy results are similar, the percentage of time saved on average by using a set of k min features ranges from 15% for the smallest dataset to 70% for the largest (Table 3).

Graphical analysis and interpretation
This section illustrates the ability of the GB-AFS method to select features with complementary discriminating capabilities, effectively covering the entire feature space.
Figure 5 shows the features of the Microsoft Malware dataset embedded in low-dimensional space accepted by using the JM and t-SNE methods.Each feature i is assigned a color based on its separation capabilities, in such a way that the higher the value, the darker the color tends to be.The k min subset of features chosen by the ReliefF method and the GB-AFS method, assuming that k min was found by our proposed method, are indicated by a rhombus and a circle, respectively.As can be seen from Fig. 5, the Reli-efF method tends to select features from the upper right corner of the graph with high separation capabilities, wheras the GB-AFS method selects features that span the entire graph, indicating a wider range of separation capabilities.A similar behavior observed in Fig. 5 regarding the ReliefF method, was also detected in the results obtained from the other benchmarked methods across all datasets.Our algorithm found 32 257 features based on their complementary separation abilities, while ReliefF marked features with similar values and not necessarily with complementary abilities Figure 6 presents the ability of the GB-AFS to select a low number of features while preserving the accuracy obtained when using the entire dataset based on the Cardiotocography dataset, using Random Forest classifier.In the graph, the contribution of each one of the 7 selected features, out of the original 23 features, to the obtained accuracy is presented.These selected features are marked with blue dots on the graph.It can be observed from the graph that using only 7 selected features by the GB-AFS method results in an accuracy of 0.746, which is slightly lower than the 0.764 accuracy achieved using all 23 features.This modest decrease in accuracy, while only using about 30% of the available features, highlights GB-AFS's capability to preserve high levels of performance with a significantly reduced number of features.

Method generalization
The GB-AFS method as illustrated in Fig. 1, doesn't assume specific statistical measures to generate the feature separability matrix and specific dimensional reduction techniques to obtain the low-dimensional feature space.Table 4 presents the results of employing the GB-AFS method with various pairings of dimensionality reduction techniques and statistical measures.For this goal, we utilized t-SNE [28], UMAP [46], and diffusion maps [19] as dimensional reduction techniques and JM distance [26,27], Wasserstein distance [47], and Hellinger distance [48] as statistical measures.
The performance of the classifiers over the datasets for the combination of UMAP and Wasserstein and the combination of diffusion maps and Hellinger in Table 4 was determined by calculating the average values over two runs, while the performance of the combination of t-SNE and JM is directly sourced from Table 2. Upon reviewing the results of Table 4, it can be observed that the GB-AFS yielded similar performances for every combination of dimensional reduction technique and statistical measure, which are superior in most cases to the compared methods in Table 2.More specifically, each Fig. 6 The results of the GB-AFS run on the Cardiotocography dataset, which led to the selection of a feature subset of 7 features, out of 23 original features.The names of these selected features are provided for reference combination of dimensional reduction technique and statistical measure obtained better balanced F-score and accuracy in 4 to 5 out of 6 combinations of datasets and classifiers.These results highlight the generality of the GB-AFS method with respect to the dimensional reduction technique and statistical measure used.It emphasizes the strength of the GB-AFS method in its uniqueness in selecting features with complementary abilities, a strategy that proves effective across diverse datasets.

Conclusion and future work
This paper presents a novel graph-based filter method for automatic feature selection (GB-AFS) for multi-class classification problems.An algorithm is developed to apply the proposed method to find the minimal subset of features that are required to retain the ability to distinguish between each pair of classes.The experimental results on five popular datasets and three classifiers show that the proposed algorithm outperformed other state-of-the-art filter methods with an average accuracy improvement of 7.5% .Moreover, in 14 of 15 cases, the GB-AFS method was able to identify 7% to 30% of the features that retained the same level of accuracy as when using all features, while reducing the classification time on an average from 15% for the smallest dataset to 70% for the largest.These findings highlight the method's efficiency in handling high- dimensional datasets.
While our study has demonstrated promising results, it is essential to consider certain limitations and opportunities for further research.Firstly, our approach is specifically tailored for tabular datasets.Secondly, our methodology does not consider the class distribution within the dataset.Future research could explore adapting the proposed method to other problems involving diverse datasets, such as audio or EEG signals [49].Additionally, it would be interesting to investigate methods for incorporating the class distribution within the dataset into the construction of the separation matrix.Another potential direction could involve utilizing the proposed method for constraint-based classification problems, a common research challenge that has received significant attention recently [50,51].In many real-world situations, each feature incurs an economic cost for collection, with limited resources being available to tackle the problem at hand.Thus, the adaption of the proposed algorithm, such that the total cost of the selected features meets the constraint, is an interesting direction to explore.

Fig. 1
Fig. 1 The GB-AFS architecture: A Generate a feature separability matrix and reduce the dimensionality.B Perform K-medoids clustering for every k ∈ [2, M] and calculate MSS.C Find the optimal k and return the corresponding k features found during clustering

Fig. 2
Fig. 2 Application of the Separability-Based Feature Space part of the GB-AFS method to 257 features from the Microsoft Malware Sample dataset.The features are embedded in a 2-dimensional space by t-SNE and are colored by their Jeffries-Matusita value

Fig. 3
Fig. 3 Silhouette, SS, MSS and accuracy results obtained by three classifiers over different values of k on the Microsoft Malware Sample dataset.A vertical dotted line represents the minimum k value, the "knee" point found by the Kneedle algorithm

Fig. 5
Fig.5 Microsoft Malware Sample features are color-coded according to their separation capabilities score.Our algorithm found32  257 features based on their complementary separation abilities, while ReliefF marked features with similar values and not necessarily with complementary abilities

Table 1
Overview of the datasets

Table 2
Results over the five datasets.A comparison of the accuracy and balanced F-score results of our method vs. state-of-the-art methods.The results of the best method are in bold for each dataset and experimental setup.Paired t-test significance at p-value < 0.05 indicated by underline 1 Decision Tree, 2 Random Forest,3Accuracy,4Balanced F-score

Table 4
A comparison between the accuracy and balanced F-score outcomes of our proposed method, incorporating various combinations of dimensionality reduction techniques and statistical measures.Best-performing methods for each dataset and setup are in bold1Decision Tree, 2 Random Forest,3Balanced F-score