Cooperative co-evolution for feature selection in Big Data with random feature grouping

A massive amount of data is generated with the evolution of modern technologies. This high-throughput data generation results in Big Data, which consist of many features (attributes). However, irrelevant features may degrade the classification performance of machine learning (ML) algorithms. Feature selection (FS) is a technique used to select a subset of relevant features that represent the dataset. Evolutionary algorithms (EAs) are widely used search strategies in this domain. A variant of EAs, called cooperative co-evolution (CC), which uses a divide-and-conquer approach, is a good choice for optimization problems. The existing solutions have poor performance because of some limitations, such as not considering feature interactions, dealing with only an even number of features, and decomposing the dataset statically. In this paper, a novel random feature grouping (RFG) has been introduced with its three variants to dynamically decompose Big Data datasets and to ensure the probability of grouping interacting features into the same subcomponent. RFG can be used in CC-based FS processes, hence called Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). Experiment analysis was performed using six widely used ML classifiers on seven different datasets from the UCI ML repository and Princeton University Genomics repository with and without FS. The experimental results indicate that in most cases [i.e., with naïve Bayes (NB), support vector machine (SVM), k-Nearest Neighbor (k-NN), J48, and random forest (RF)] the proposed CCFSRFG-1 outperforms an existing solution (a CC-based FS, called CCEAFS) and CCFSRFG-2, and also when using all features in terms of accuracy, sensitivity, and specificity.

problems across various domains consist of several features (attributes in dataset terminology). However, not all of these features are important, because some are irrelevant or redundant, and as such, may degrade the performance of ML classifiers [3,4]. Feature selection (FS) is an approach to select the relevant features for reducing the dimension of data in order to improve ML performance [5]. Formally speaking, feature selection is a process to select a subset of s features from a full set of n features ( s < n ) in a dataset by removing irrelevant and unimportant features, thereby representing it with less features [2]. A search technique initiates the FS process to discover feature subsets. Then, feature subsets are evaluated by different performance measures, for example, classification accuracy. A terminating criterion, for instance, the maximum number of generations, is used to terminate the FS process. A validation method at the end of the FS process can test the validity of the selected subset of features.
A dataset consisting of n features has 2 k possible solutions, which makes the feature selection process computationally expensive. A wide range of search techniques can be applied to the FS process, for example, greedy search, best search, or evolutionary search [6]. Evolutionary algorithms (EA) are search techniques that are widely used for feature selection [7]. However, with the increase in data samples and features in a dataset, the search space also increases. In most cases, this impacts the effectiveness of EAs. Cooperative co-evolution (CC), a meta-heuristic algorithm, follows a divide-and-conquer technique and has been effectively applied in different domains, including a limited number of FS applications. CC decomposes each complex problem into multiple subproblems, optimizes each subproblem individually, and collaborates different subproblems to build a complete solution to the problem [5]. Hence, a CC-based FS approach with a suitable decomposition method, an appropriate optimizer, and a proper collaboration technique can be studied for feature selection problems in different Big Data domains. However, the performance of a CC algorithm (i.e., solution quality) depends on how the problem is decomposed [8][9][10][11][12]. The performance of a CC algorithm may degrade significantly if the problem is non-separable, such as if the features interact with each other [9,13,14]. In the literature, it is suggested that the interdependency between the decomposed subproblems should be minimized because of the collaboration requirement of CC algorithms [15]. However, for real-world problems without any prior information about how the features in a dataset interact, it is difficult to find a suitable problem decomposition technique for feature selection. Non-CC-based feature grouping methods are also investigated in the literature [16][17][18]. This paper aims to investigate CC-based feature grouping.
To overcome the aforementioned limitations, this paper introduces a CC-based FS framework with a proposed random feature grouping (RFG) decomposition technique: Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). This approach has been evaluated with two variants using six widely used ML classifiers on seven different Big Data datasets from the UCI ML repository 1 and Princeton University Genomics Repository. 2 Based on the analysis of the comparative results, it has been observed that in terms of accuracy, sensitivity, and specificity, CCFSRFG with RFG-1 in most cases outperforms CCEAFS [19] and the classifiers when using all features, along with a significant feature reduction.
The rest of the paper is organized as follows. Next section presents feature selection approaches using a taxonomy and an extensive literature review of problem decomposition methods. After that, a novel cooperative co-evolution technique and a novel feature selection framework is included. Then, a proposed random feature grouping with three variants is illustrated. The experimental results are presented and analyzed in "Results and discussions" section. The conclusion and future work directions are included in the last section.

Literature review
Many feature selection approaches studied in the literature are based on various metrics, such as information theory, probability distribution, and classification accuracy [20]. Table 1 presents a taxonomy of these.
There are a few FS research works that have been studied in the literature based on CC [21][22][23][24][25][26][27][28][29]. The literature indicates that the FS based on CC is an emerging research field and it is yet to be investigated for optimization problems. Based on this finding, a CC-based FS framework (CCEAFS) has been proposed in our previous study [19]. In CCEAFS, the FS problem has been decomposed into subproblems based on static decomposition; each of these subproblems has been optimized using a genetic algorithm, and a complete solution to the problem has been created using N + 1 collaboration with random collaboration for the first generation, and using best individuals as collaborators for further generations. CCEAFS has a promising performance and significant feature reduction capabilities while maintaining classification performance, and in a few cases, it performs better than using all features. Nevertheless, CCEAFS has some limitations: (1) it does not consider feature interactions for features with tightly coupled relationships and which potentially increase classification accuracy; (2) it considers datasets with an even number of features and decomposed Table 1 Taxonomy of FS approaches [5,74,75] GP genetic programming, PSO particle swarm optimization, ACO ant colony optimization, TLBO teaching learning-based algorithm, GSA gravitational search algorithm, CMIM conditional mutual information maximization, BGA binary genetic algorithm, mRMR minimum redundancy maximum relevance, TLBOL TLBO with opposition-based learning, DE differential evolution, MA memetic algorithm, LCS learning classifier system, ES evolutionary strategy, ABC artificial bee colony datasets statically with an even number of features in each subdataset; however, in practice, datasets may have an odd number of features; (3) it considers datasets with a high number of samples and a low number of features and has not been validated using datasets with other characteristics. The performance of a CC mostly depends on the appropriate decomposition methods, suitable optimizers, and proper collaboration techniques [30]. This paper aims to overcome the limitations of the previously studied CCEAFS framework for feature selection in Big Data, and also to propose a new decomposition method for feature selection.

Preliminaries
This section defines the variable interaction and problem structure, whether it is separable, partially-separable, partially additively separable, or non-separable.
Definition 1 Decision variables i and j from a decision vector x are interacting if the ith and jth variables can be replaced by x ′ i and x ′ j such that [31]: where x = {x 1 , x 2 , . . . , x n } is a decision vector of n decision variables.
Definition 2 Function f (x) of n decision variables is called separable if it is represented as a sum of n independent functions having only one independent variable in each [32,33]: The Rastrigin function is an example of separable functions [34]: where − 5.12 x i 5.12 and the global minimum of (3) is at the point x = {0, 0, 0, . . . , 0} .
One of the characteristics of this function is that it has many suboptimal peaks whose values are increased with the increase of the distance from the global optimal point. (1) , argmin ( Griewank's function is a classical example of partially separable functions, which is defined as [36]: where the f (x) has only one global minimum at the point (0, 0, . . . , 0).
Definition 4 Function f (x) of n decision variables is called partially additively separable having m independent subcomponents defined as [33,35]: where x i represents mutually exclusive decision vectors of f i . Definition 5 Function f (x) is a n-non-separable function if n of its decision variables x i are not independent [32].
Function f (x) is fully-non-separable if any two of its decision variables x i are not independent, i.e., every pair of the decision variables interact with each other [33,35].
The extended Rosenbrock function is an example of a non-separable function, which is defined as [32]: where the function has exactly 1 minimum at point (1, 1, 1) for n = 3, and 2 minima for 4 n 7 when the function gradient equals to zero.

Review on problem decomposition approaches using CC
The critical step of a cooperative co-evolution (CC) technique is how to decompose the problem into subproblems [37]. Generally speaking, a CC algorithm should decompose a large problem into many subproblems with minimal interdependency between them. The first problem decomposition methods using CC used two simple approaches: onedimensional-based and splitting-in-half strategies [38]. The former type decomposes an n-dimensional problem into n one-dimensional subproblems. This strategy is quite simple and can be effective for separable problems. However, this does not consider the interdependencies between subproblems. Therefore, it is not suitable for handling nonseparable problems satisfactorily. In comparison, the splitting-in-half strategy always decomposes an n-dimensional problem into two equal n 2 -dimensional subproblems. However, when n is large enough, splitting the problem into n 2 -dimensional subproblems will still be large resulting computational difficulties [39].
Considering the interdependencies between subproblems, a CC framework, EACC-G, was proposed for high-dimensional optimization problems based on random grouping (RG) [39]. This framework randomly divides the decision variables of the high-dimensional problem into a number of groups with predefined group size. It provides a nonzero probability of assigning interacting variables into the same group. In RG, every decision variable has an equal probability of being assigned to any of the subcomponents, and at the beginning of each cycle, the entire process is repeated. This results in a better performance than one-dimensional-based and splitting-in-half strategy decomposition methods for the CC framework. However, the problem here is to define the optimal group size, which is problem-dependent. A small group size may be suitable for separable problems. A larger group size can be useful for non-separable problems, because this can provide a higher probability grouping for more interacting decision variables into the same group. Furthermore, the selection of group size can be difficult at different stages of optimization for a single problem. A small group size can be helpful for finding solution regions faster at the initial stages; however, for later stages, a large group size is preferable, containing more global information for the subproblems. Hence, the appropriate group size depends on the objective function, optimization, and decision variables. To obtain the appropriate group size, a multilevel CC framework (MLCC) was proposed by the same research group in 2010 [40]. In MLCC, a decomposer pool is built based on different group size and using random grouping strategy, where each decomposer in the pool has various interaction levels between decision variables. The evolution process is divided into several cycles, and at the start of each cycle, a decomposer is selected from the pool depending on their record of performance. The selected decomposer is used to decompose the problem into subproblems having decision variables. The record of performance of each decomposer is updated at the end of each cycle. Following this strategy, MLCC is able to self-adapt to an appropriate interaction level regardless of the decision variables, objective function, and optimization stages. To further improve performance, MLCC also includes a weight vector to co-adapt subcomponents. To group interacting variables together, a variable interaction learning (VIL) method was proposed that accounts all decision variables as independent initially and place them into a separate group. It further investigates the variable interactions iteratively, and when interactions are identified by a pairwise interaction between variables using a non-monotonicity detection, it merges variables into the same group [31]. This method identifies the subcomponents structure and adjusts the component size using two phases: learning and optimization.
Although DECC-G and MLCC provide opportunities to decompose a large problem into subproblems efficiently, they suffer from two major weaknesses, which degrade algorithmic performance. The first one is the decrease of probability to group interacting variables in a subproblem when the number of interacting variables is increased. The second one is that the adaptive weighting in MLCC is not important in the entire process and sometimes fails to improve the quality of the solution and also increases the number of fitness evaluations. To improve the performance of MLCC, a more frequent random grouping (DECC-ML) method has been proposed by Omidvar et al. [14]. The DECC-ML method reduces the number of fitness evaluations to find a solution without deteriorating solution quality. In this method, to choose a decomposer, a uniform random number generator has been used, unlike MLCC, which uses a sophisticated probability formula. After DECC-ML, Omidvar et al. [41] also proposed another decomposition method called Delta Grouping. When two interacting variables are grouped into different subcomponents, each variable can have a limit to be improved towards the optimal value in a non-separable function. They calculated the amount of change called delta value in each decision variable in every cycle, which can identify interacting variables. Delta values measure the averaged difference for a specific variable over the entire population to identify interacting variables. The assumption here is to have smaller delta values for interacting variables. The decision variables are sorted based on delta values, and variables with smaller delta values are grouped into the same subcomponent. Compared to previous techniques, delta grouping is effective to group interacting variables. However, this grouping method performs poorly if the objective function has more than one nonseparable subcomponent.
A metamodel-based decomposition method called high-dimensional model representation (HDMR) has also been proposed, which considered two variables non-separable when they have a cooperative effect on the approximated second-order HDMR model [42]. HDMR uses a map of linkage between input and output system variables. For largescale fully separable continuous functions, Omidvar et al. introduced a decomposition approach, multilevel softmax (MLSoft), in 2014 [43]. MLSoft is basically a modified MLCC algorithm proposed based on reinforcement learning for adaptively finding the proper subcomponent size. The experimental analysis concludes that a subcomponent size should be as small as possible. Because it is not possible to predict the capacity of an optimizer in advance, optimal subcomponent size still requires empirical analysis.
Random grouping and delta grouping decomposition methods decompose an n-dimensional problem into k s-dimensional subproblems. However, the major drawback of these approaches is to determine the value of k or s. When the objective function has large groups of interacting variables, a small s value can decrease algorithmic performance. In contrast, when the objective function has several small groups of interacting variables, a large s value cannot ensure proper utilization of decomposition techniques. Although MLCC [40] solves the problem of specifying the value of s through the use of a decomposer pool with a set of possible s values, it still needs to determine the set of potential s values. In addition, selecting an s value in MLCC divides the decision variables into a set of equally-sized subcomponents, which is unlikely to have equallysized interacting groups for most real-world problems. To overcome these limitations, an automatic decomposition method, differential grouping (DG) was introduced by Omidvar et al. in 2013 [44]. DG starts investigating the interaction of the first decision variable with the rest of the variables in a pairwise fashion based on the definition of partially additively separable functions. When DG identifies an interaction between the first variable, and any variable, it removes that variable from the set of decision variables and places it in a subcomponent. This process is iterated until all other interactions between the rest of the variables and the first variable is identified and the first subcomponent is built. When no interaction is found, the variable is assumed separable. The entire process is iterated for the rest of the variables to find interactions for all decision variables. Even though DG achieves promising performance in decomposing a large problem into multiple subproblems by grouping interacting variables together, it has some drawbacks, namely, not detecting overlapping functions, being slow to check all interactions, requiring a threshold parameter, and being sensitive to the choice of the threshold parameter.
DG can identify the decision variables, which interact directly, and to identify the indirect interactions between the decision variables, an extension of DG, extended DG (XDG) was proposed by Yuan et al. in 2015 [45]. In XDG, first the variables with direct interactions are grouped to the subcomponents and then the overlappings between subcomponents are detected. When an overlap is detected, the subcomponents with same decision variables are combined. XDG may perform poorer in terms of identifying the interaction among multiple variables. Besides, XDG cannot model the decomposition problem formally, and it inherits the sensitivity issue of DG. To overcome these drawbacks, in 2016, Yingbiao et al. [15] proposed another improvement of DG, called graphbased DG (gDG). This gDG starts by identifying all separable variables and allocating them into the same group. Here, each decision variable is considered as a vertex, and an edge is an interaction between two variables. A graph is then constructed from this arrangement to model the decomposition problem, and the connected components are identified.
In 2016, Mei et al. [46] proposed a global differential grouping (GDG) approach to improve the accuracy of the DG algorithm by adopting the DG mechanism and by maintaining the global information in terms of the interaction matrix between decision variables. In addition, GDG also considers computational error. The authors identified three drawbacks of DG: (1) the comparison between the variables is incomplete; (2) it tends to miss several interactions between decision variables, being the grouping performance sensitive to a threshold parameter; and (3) it cannot work appropriately with fully separable functions or when the functions have a large portion of separable variables. To overcome these limitations, GDG uses three different measures: interaction, independence, and interaction and independence combined. GDG can detect variable dependency and can isolate them into the same groups more accurately. GDG overcomes the sensitivity issue of DG by considering computational errors. However, it is not suitable for imbalanced functions because of the global parameter to identify all interactions.
DG uncovered variable interdependency for making near-optimal decomposition with the cost of ignoring indirect non-separable interactions. XDG and GDG solved the indirect non-separability problem; however, they both need high computational cost for the interaction identification process. This leads to a suboptimal performance in the following optimization process. The identification process required substantial computational time because of using a pairwise fashion to detect the interactions by most of the algorithms. To address this, a fast interdependency identification (FII) algorithm was proposed by Xiao-Min et al. in 2016 [11]. FII performs the decomposition by identifying the interdependency information for separable and non-separable variables. When non-separable variables are discovered, their interdependency is further investigated. FII requires no complete interdependency for non-separable variables, because it avoids interdependency identification in a pairwise fashion. As a result, FII can save a significant number of fitness evaluations for the following optimization process. The nearoptimal decomposition can thus be obtained at a small computational cost.
DG is a breakthrough for large-scale problem decomposition. However, it has a number of drawbacks, and to address these, several improvements have been proposed based on DG. Similarly, another improvement of DG, called DG2, has been proposed by the same research group who introduced DG [47]. For fully separable functions, DG2 reduces the number of fitness evaluations by half. This, in turn, supports the algorithm to check every pair of interacting variables at a significantly lower computational cost compared to the previous approaches. The overlapping functions can be effectively identified by testing all pairs of variables interaction. If the DG mechanism is used to develop DG2, a systematic generation of sample points maximize the point reuse, which provides lower bound to detect the interactions. Beyond efficiency, grouping accuracy of DG is also improved significantly with DG2. Since DG2 considers the computational rounding-errors to estimate an appropriate threshold level, it can determine the sensitivity to weak variable interactions. The automatic calculation of the threshold value is useful to deal with imbalanced functions. Furthermore, unlike DG, DG2 does not require an external parameter.
Because previous decomposition algorithms required a large number of fitness evaluations (otherwise failed to obtain proper decomposition results), a historical interdependency based differential grouping (HIDG) was proposed in 2018 [12]. Similar to FII, HIDG also uses decision vectors to investigate the interdependencies among vectors in a systematic way. Based on the historical interdependency information, HIDG proposes a criterion to infer interactions for decision vectors without using extra fitness evaluations. To reduce the computational expense of the decomposition algorithm, a recursive differential grouping (RDG) method was proposed by Yuan et al. in 2017 [10]. RDG considers the interaction between decision variables depending on non-linearity identification. It recursively analyzes the interaction between variable x i and the rest of the variables. When an interaction is detected, the rest of the variables are divided into two groups of equal size. Then the interaction between variable x i and each group is examined. The process is repeated until all independent variables that interact with x i are detected and placed into the corresponding subcomponent as x i . The analytical methods validate the effectiveness of RDG without the need to examine the variable interactions in a pairwise fashion.
Another improvement of DG was proposed to overcome its drawbacks regarding the threshold parameter and identify indirect interaction, called ε-based DG ( ε-DG) [8]. In general, ε is a predefined threshold value, and ε is zero [44]. According to DG, when delta > ε , the decision variables are considered non-separable; otherwise, they are separable. However, authors of ε-DG experimented and observed that ε = 0 is ineffective because ε > 0 also occurs with separable variables, i.e., setting ε = 0 can classify a few decision variables to be non-separable, which are supposed to be separable. ε-DG detects that delta > 0 for two separable variables are resulted from a computational error. Hence, ε-DG first checks whether delta > 0 is resulted from an interaction or computational error. If it results from a computational error, delta is set to zero, which can solve the configuration issue of ε . In ε-DG, an n × n DG matrix (DGM) is constructed to represent the interactions between every pair of decision variables. The DGM is able to identify both direct and indirect interactions by setting an element of DGM to zero when there is no interaction between two associated variables; otherwise, it sets to a non-zero value.
Based on the concept of partial derivatives of multivariate functions, GDG has been proven to be an effective grouping strategy in automatically resolving the problem by the global information between variables. However, the grouping results of GDG are no longer updated once GDG decomposed a problem into subproblems and it is not automatically adjusted throughout the algorithmic evolution process. Hence, to resolve the problem by updating the grouping results throughout the evolutionary process, an improvement of GDG was proposed by Wu et al. in 2018 [48], called Dynamic GDG (D-GDG). Based on GDG, in D-GDG, the original data matrix of interacting variables is obtained from the general idea of the partial derivative of multivariate functions. The fuzzy clustering technique is used to discern the dynamic clustering of decision variables during the standardization of the original data matrix and to build the fuzzy relation matrix. The size of each group of interacting variables is limited by the lower and upper bounds of the variable grouping scale. The grouping of variables has been adaptively adjusted according to the algorithmic state throughout the optimization process.
In terms of time complexity, RDG is a very effective decomposition method. However, to identify whether two subsets of variables interact, a proper parameter setting is required to determine a threshold value. Furthermore, one global threshold value may not be sufficient for determining interactions of variables in subcomponents having a dissimilar contribution to the fitness value. To solve these problems, a decomposition method inspired by DG2 and RDG was proposed, called RDG2 [49]. In RDG2, the threshold value is adaptively estimated based on the computational round-off errors of RDG. An upper bound of the round-off errors can be adequate to distinguish between separable and non-separable variables of large-scale benchmark problems. Besides, experimental results have shown a higher decomposition accuracy of RDG2 compared to RDG and DG2.
All the preceding decomposition algorithms usually group linked or interacting variables into the same group. However, this does not tend to reduce the size of the original problem. Moreover, overlapping problems pose a further challenge to the decomposition algorithms because of the dependency of the components on each other. To address these issues, the RDG decomposition algorithm has been further modified for decomposing overlapping problems. This new algorithm, RDG3, breaks the linkage at variables, which is shared by more than one component. The performance of RDG3 has been evaluated by overlapping benchmark problems, which confirmed its superiority over RDG and other decomposition algorithms [50]. Table 2 presents a summary of the papers reviewed on problem decomposition techniques with key features.
From the literature, it has been observed that problem decomposition techniques fall into two categories [11,51]: Static decomposition A problem is decomposed into subproblems before the evolutionary process of CC starts, and the decomposed subproblems are fixed for the rest of the generations [11]. In the original CC implementation (CCGA), a problem of n-dimension was divided into n one-dimensional subproblems [34]. With this one-dimensional-based decomposition, CCGA performed better on separable problems than on non-separable problems (in CCGA, Rosenbrock was used as a non-separable problem). This concluded two major problems of static one-dimensional-based decomposition. First, the search capacity of CC may degrade drastically if there are interdependencies exist between the subcomponents. Second, one-dimensional decomposition may waste computational resources during the evolutionary process. The splitting-in-half static decomposition method cannot handle non-separable problems properly either [38]. By exploring the interdependency information in the problem, a number of detection-based static decomposition techniques proposed to make the near-optimal decomposition. Examples of such static decomposition methods include CCVIL [31], DG [44], XDG [45], GDG [46], RDG [10], D-GDG [48], RDG2 [49], and RDG3 [50]. These methods are also called automatic decomposition strategies as they undertake the variable interactions to group the variables [10].
Dynamic decomposition A problem is decomposed into multiple subproblems at the start. However, it has the ability to adapt the subcomponents during the evolutionary process, i.e., the decision variables can be regrouped into subcomponents in each cycle of the evolutionary process [11]. RG is a typical example of a dynamic decomposition technique [39]. In RG, at the start of each evolutionary process, two interacting decision variables have a non-zero probability of being grouped in the same subcomponent. However, as the number of interacting variables increases, RG becomes incapable of grouping all the interacting variables in the same group. Other approaches to improve the performance of dynamic decomposition techniques are also available including MLCC [40], DECC-ML [14], and delta grouping (DECC-D) [28].
The previous extensive literature review on problem decomposition approaches using CC found that the problem decomposition techniques have been widely used, improved, and evaluated on a range of benchmark functions. Apart from these strategies, a number of decomposition methods have also been studied in the literature with CC for some real-world problems. For example, a random-based dynamic grouping method was proposed to group variables for multi-objective optimization problems, which uses a decomposer pool similar to MLCC [40]. In this method, the decomposer pool consists of several groups of varying sizes, and the size is determined by C-metric based on historical performance [52]. A weighted random (WR) grouping strategy proposed for large-scale crossing waypoints location problem (a fully non-separable and non-differentiable problem) in air route networks where the space of the problem was updated by extreme weather conditions, breakdowns, and delayed aeroplanes [53]. A Markov

Key features Limitations
One-dimensional-based [34] An n-dimensional problem into n one-dimensional subproblems Quite simple and effective for separable problems Does not consider the interdependencies between subproblems; unlikely to handle the non-separable problems satisfactorily Splitting-in-half strategy [38] An n-dimensional problem into two equal n 2 -dimensional subproblems Simple and effective for separable problems like onedimensional-based approach When n is large enough, splitting the problem into n 2 -dimensional subproblems will still be large; computationally expensive; handles even number of dimensions only DECC-G [39] Randomly divides the decision variables in the highdimensional problem into a number of groups with predefined group size Provides a non-zero probability of assigning interacting variables into the same group Defining optimal group size, which is problem-dependent; selection of group size; a decrease of probability to group interacting variables in one subproblem when the number of interacting variables is increased MLCC [40] A decomposer pool of variable group size and random grouping  Does not consider weak linkage breaking random field-based decomposition method proposed to solve the stochastic flexible job shop scheduling problem for minimizing the expectation and variance of makespan [54].
Because the number of existing work on FS using CC is limited, problem decomposition techniques using CC for FS are also limited in the literature. A quantum-inspired self-adaptive CC incorporated into shuffled frog leaping algorithm was proposed by Ding and Wang in 2013 to reduce the number of attributes [55]. In this framework, the attribute sets were divided into subsets using a self-adaptive technique that selects a decomposer from the decomposer pool based on their historical performance records. The performance of this approach was evaluated on global optimization functions and UCI datasets. Recently, in 2019, Wang et al. [29] proposed a two-stage decomposition approach (CCFS/TD) based on CC for FS on high-dimensional classification problem. In the first stage, CCFS/TD decomposes the evolutionary cycle into m levels and performs optimization. In the second stage, the evolutionary cycle of each level is further decomposed into a number of independent processes using a majority voting technique. The effectiveness of the proposed CCFS/TD method has been validated by experiments on 10 benchmark datasets. A study by Sun et al. in 2015 [9] on the selection of decomposition methods for large-scale fully non-separable problems performed experiments on decomposition techniques using RG, delta grouping, DG, and XDG. This study observed that RG is the most effective decomposition method for fully non-separable problems. Because FS problems are fully non-separable, based on the motivation from the study of [9], in the next section of this paper, a random-based decomposition method, called random feature grouping (RFG) is proposed with three variants.

A novel feature selection approach
In this paper, a CC-based FS framework is introduced with a proposed random feature grouping (RFG) decomposition technique, Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). The motivation to propose RFG comes from the success of the random grouping decomposition strategy for non-separable function optimization problems and because the feature selection problem is a nonseparable optimization problem [9,13,14,39,56]. CCFSRFG has been applied to the FS problem in Big Data using six widely used ML classifiers, naïve Bayes (NB) [57], support vector machine (SVM) [58], k-Nearest Neighbor (k-NN) [59], J48 [60], random forest (RF) [61], and logistic regression (LR) [62], on seven different datasets from the UCI ML repository. 3 At first, the ML classifiers have been applied to all datasets without reducing the number of features. After that, CCFSRFG with two variants, is applied to all classifiers for reducing the number of features of the datasets. A penalty-based wrapper objective function defined in a previous study has been used as the fitness function in the CCFSRFG framework to deal with the twofold objectives of FS (reducing the dimension of features, and improving classification performance) in process in Big Data. The comparative results have been analyzed based on different performance measures, including accuracy, sensitivity, and specificity [63].

Cooperative co-evolution
In 1994, Potter and De Jong [34] introduced the cooperative co-evolution (CC) technique to solve optimization problems, which are large-scale and complex. This technique uses a divide-and-conquer approach to divide a large problem into multiple subproblems, and iteratively evolves the interacting co-adapted subproblems to build a complete solution.
The CC technique consists of decomposing an n-dimensional problem of search information S = {1, 2, . . . , n} into m subproblems {S 1 , S 2 , . . . , S m } [34]. Each subproblem of the n dimensions represents a new search space SP (i) for a particular problem, where the rest of the dimensions n j , with j = S i , are kept fixed. Hence, the entire search space is decomposed into m subproblems with lower dimensions and that can be handled using any populationbased evolutionary computational approach. These subproblems can be optimized individually, and communication between them is required only to evaluate the objective (fitness) function f. This implies that a candidate solution in search space SP (i) contains a few elements (comprising an individual I) of the n-dimensional problem ( I ∈ SP ). Therefore, in CC, a common n-dimensional context vector v is required to build using a collaborative individual (e.g., the current best individual) from each subproblem. A candidate solution to the problem is then formed by a cooperative algorithm to evaluate an individual in a subproblem, a candidate solution is built by joining representative collaborators from the context vector. Potter and De Jong decomposed an n-dimensional problem into n 1-dimensional subproblems in their original CC idea. In general, the n-dimensional problem can be decomposed into m subproblems with same dimension, i.e., n m = n/m [64]. Following this, a CC technique can be formally defined as follows [65]: An n-dimensional problem is decomposed as: and the context vector is built as: where v (i) is the n m -dimensional problem, which represents the collaborative individual from the i-th subproblem (e.g., the current best individual in SP (i) ): Given the j-th individual I (i,j) ∈ SP (i) of the i-th subproblem: The fitness function of this problem is given by f v (i,j) , where v (i,j) is defined as: , (I In general, the fitness of v (i,j) is evaluated on context vector v by replacing the elements of the individual from the i-th subproblem having the representative elements of individual I (i,j) . CC is therefore comprised of three main phases: (1) problem decomposition, (2) subproblem evolution, and (3) collaboration and evaluation.

Problem decomposition
The first phase of CC is how to decompose a large problem into subproblems generally depends on the problem structure [30]. If the problem is decomposed statically, it will have one element in each subproblem, whereas if the problem is decomposed dynamically, the grouping of elements into subproblems will be different than with the former. One objective of this paper is to review existing CC-based problem decomposition techniques and propose a suitable decomposition method for feature selection. Hence, CC problem decomposition is described in detail in "Literature review" section.

Subproblem evolution
Once the problem is decomposed into subproblems, each subproblem is allocated to a subpopulation. Each subpopulation is optimized individually either through a homogeneous or a heterogeneous evolutionary optimizer [30]. Optimizations of subpopulations can be carried out either sequentially or in parallel. In the sequential case, only one subpopulation evolves in each generation [66]. In contrast, in the parallel case, all subpopulations evolve in each generation simultaneously [67]. The most widely used evolutionary optimizer in this area is a genetic algorithm (GA), whereas differential evolution (DE) [68] is the most effective optimizer for CC.

Collaboration and evaluation
After subproblem optimization, subpopulations cooperate to build a complete solution to the problem. The fitness (objective function) of an individual is evaluated by selecting a collaborator individual from each subpopulation. The performance of this collaboration is specified as the fitness value to a individual being evaluated. Individuals having the best collaborating performance are joined together to discover the final solution to the problem at the end of a CC process [30]. 1 + 1 collaboration [37], the 1 + N collaboration model [69], and reference sharing (RS) [30] are examples of collaboration models used in CC.

Methodology
In this paper, a CC-based feature selection framework in Big Data with a random feature grouping (RFG) is introduced, called CCFSRFG. Considering correlations to be linearly associated with a dataset's records, for example, in a network traffic dataset, a correlation measure, such as a linear correlation coefficient, can be applied to measure the linear dependency between two random features. However, there are many real-world problems where the correlation between the features can be nonlinear as well. Therefore, a linear correlation study cannot measure the nonlinear dependency between two features. As a consequence, irrespective of whether the dependency between two features is linear or nonlinear, selecting a subset of features from the dataset that maximize classification accuracy is more appropriate [70]. Accordingly, the feature selection framework, CCFSRFG, with FRG as a decomposer, is proposed to select a suitable subset of features without considering its correlation. The RFG will increase the chance of optimizing interacting features together. The proposed methodology of CCFSRFG is displayed in Fig. 1.
The main idea behind this new RFG-based cooperative co-evolution feature selection (CCFS) framework is to dynamically decompose the feature vector into ms-dimensional subdatasets (n = m · s) . The fundamental difference between this new approach and the previously studied CCEAFS [19] approach is that CCFSRFG decomposes the feature vector dynamically to increase the probability of grouping interacting features together for subsequent optimization phase, which can lead to improving solution quality; while CCEAFS decomposes the feature vector statically starting from the feature indexed at 1, which cannot ensure the grouping of interacting features and may result in a low-quality solution. Formally, the proposed CCFSRFG can be described as follows: A dataset D consisting of n features, i.e., D = f 1 , f 2 , f 3 , . . . , f n . D is decomposed randomly into m subdatasets with s(s < n) features in each subdataset: Each subdataset is represented using a subpopulation in CCFSRFG. Here, s is the number of features in each individual (i.e., s features of a subdataset). Consider the size of each subpopulation (sp) is sz. An example of subpopulation sp 1 consisting s individual can be the following: 1 in an individual indicates that the feature in the corresponding is selected for the feature subset selection; 0 indicates that the feature is not selected for the feature subset selection. An individual in any subpopulation is evaluated by joining collaborators (i.e.,   Fig. 1 Proposed CCRFG-based feature selection individuals) from other subpopulations. For example, to evaluate individual ind 1 in subpopulation sp 1 , s.t. collaborator ind 3 from subpopulation sp 2 and collaborator ind 2 from subpopulation sp 3 , these three individuals are joined together to build a complete solution for the dataset with a reduced number features. Consider a random decomposition of 9 features into three subpopulations (s = 4) , is assumed with sp 1 ind 1 = f 3 , f 9 , f 7 , f 2 , sp 2 ind 2 = f 6 , f 1 , f 5 , f 8 , and sp 3 ind 4 = f 4 . If features f 7 , f 2 from sp 1 ind 1 , f 1 , f 5 from sp 2 ind 2 , and f 4 from sp 3 ind 4 are selected because of a binary (0 or 1) representation of features, the complete solution with sorted feature indices can be defined as follows: The solution with this reduced number of features is then evaluated by the ML classifiers to measure accuracy, sensitivity, and specificity performance. The best individual with a reduced number of features and the highest classification accuracy is achieved by a penalty-based wrapper objective function introduced in the previously studied CCEAFS approach [19]. The objective function for this is defined as: where f 1 = T c /T and f 2 = S/N. f is the overall objective function; w 1 and w 2 are two control parameters for the objective functions f 1 and f 2 , which are used to adjust the penalty term for f 1 and f 2 , with w 1 + w 2 = 1; T is the total number of test or train samples in the dataset (the test or train samples depend on the classification mode of using cross-validation or the supplied test set); T c is the number of correctly classified instances in the test or train samples; S is the number of features selected in the subset; and N is the total number of features in the dataset.
For CCFSRFG-1, the best individuals from other subpopulations are used as collaborators from generation 1 onwards, whereas in the case of CCFSRFG-2, for all generations, random individuals from other subpopulations are used as collaborators. The process continues until it reaches a fixed number of generations, or until no better fitness is achieved over the generations.
Datasets from the UCI ML repository have been gathered, and Microsoft Excel and WEKA 4 used to preprocess the datasets. The preprocessing stage mainly deals with the dataset conversion from CSV to the Attribute-Relation File Format (ARFF) to make it compatible with the experimental environment. The datasets were evaluated using six ML classifiers: NB, SVM, k-NN, J48, RF, and LR. The performance of these classifiers was measured in terms of accuracy, sensitivity, and specificity. CCFSRFG was employed to the datasets to decrease the number of features. The datasets with reduced dimensions were then evaluated using the same six ML classifiers, and the performance of each classifier was measured using the aforementioned metrics. The performance results obtained by the classifiers with and without FS were investigated, and the effect of FS on the performance of the classifiers was analyzed.
Algorithms 1-2 are the pseudocodes of the proposed CCFSRFG framework with two implementations of RFG. The fundamental difference between algorithms CCFSRFG-1 and CCFSRFG-2 has been indicated by line 32 in Algorithm 1 and line 26 and 33 in Algorithm 2, respectively. A JAVA-based implementation of the framework is available at GitHub. 5 In the next section, a new novel random feature grouping (RFG) is proposed with its three variants to used as the decomposer for CCFSRFG for decomposing the dimension of a dataset into lower-dimensional subdatasets.

A novel random feature grouping technique
The n 1-dimensional decomposition strategy groups a single feature into a single subcomponent. For example, if there are 9 features in a dataset, the 1-dimensional grouping will be like the following for i individuals for every generation: For each generation of an evolutionary process, the grouping will be fixed. Similarly, an m · s-dimensional dataset (m = 3 subcomponents, s = 4 features in a subcomponent) can have the following grouping with 9 features for all individuals in every generation: Similar to n 1-dimensional grouping, a m · s-dimensional grouping also maintains the same group components throughout the evolutionary process.
The proposed random feature grouping (RFG) can group features in dataset into different groups using three different ways based on the motivation from [9,14,35,39].

RFG-1
An n-dimensional dataset is decomposed randomly into m-subcomponents and the features in the subcomponents are fixed for every generation. For example, if a dataset has 9 features, the RFG-1 groups 9 features into 3 subcomponents (s = 4) features) in the first generation for i individuals the following way: This will be maintained within the same group components for all individuals and for all generations (from 0 to g), i.e., the groups will be fixed for further generations and only a single time at the start of an evolutionary process, the problem will be decomposed randomly into subproblems. Theorem 1 states the probability of grouping interacting features using RFG-1.

Theorem 1 Given I individuals, the probability of grouping v interacting features
f 1 , f 2 , . . . , f v into a single subcomponent for at least i individuals in any generation based on the probability function of more frequent grouping for function optimization problems can be defined as [14]:

where I is the number of individuals, v is the number of interacting features, s is the number of subcomponents (i.e., number of subproblems or number of subpopulations), random variable F is the number of times v interacting features are grouped into the same subcomponent, F i, and i I.
Here, the term 1  Figure 2 shows the probability of grouping interacting features randomly into a single subcomponent with the increase in the number of interacting features for any individual in any generation of an evolutionary process.
It can be observed from the graph that since in RFG-1, the features are grouped randomly only once at the beginning of an evolutionary process, the probability of grouping 9 features tends to zero irrespective of the number of individuals. However, if the number of interacting features is small, the probability of grouping is large, which is independent of the number of individuals shown in Fig. 2. The figure shows that when the number of interacting features is about 10, the probability of grouping these features into a single subcomponent is nearly zero. The probability of grouping interacting features randomly into a single subcomponent in term of the number of individuals in any generation of an evolutionary process is shown in Fig. 3. It can be seen that since the features are grouped randomly only once at the beginning of the evolutionary process, for any number of interacting variables, the probability will not be changed. For example, whether the number of individuals is 0 or 100, for v = 3 , the probability remains the same.

RFG-2
An n-dimensional dataset is decomposed randomly into m subcomponents in every generation unlike RFG-1, in case of which the dataset is decomposed randomly only once. However, similar to RFG-1, in RFG-2, all individuals will have the same features in the same subcomponents for any given generation. For example, for a dataset with 9 features, if the first generation follows the grouping of features for all individuals as illustrated in the case of RFG-1, the RFG-2 groups 9 groups into 3 subcomponents (s = 4 features) in the second generation randomly as follows: In this strategy, in each generation, feature grouping will be changed and there will be a probability for every feature to be a subcomponent with interacting features. This will also increase the selection probability of a feature into the feature subset selection. RFG-2 can ensure the grouping of any number of interacting features in a subcomponent similar to RFG-1. Therefore, Theorem 1 holds not only for RFG-1, but also for RFG-2, in terms of defining the probability of grouping interacting features.
Algorithm 4 is the pseudocode of the proposed random feature grouping technique (RFG-2).

RFG-3
An n-dimensional dataset is decomposed randomly into m subcomponents in every generation and all individuals have the same features in the various subcomponents in each generation. For example, if a dataset has 9 features, RFG-3 groups 9 groups into 3 subcomponents (s = 4) for all generations randomly and for all individuals with random features as follows: In this method, in each generation, the grouping of features will be changed for all individuals. RFG-3 can increase the probability for every feature to be a subcomponent with interacting features and can also increase the selection probability of features into subset selection. RFG-3 can ensure the grouping of any number of interacting features in a subcomponent similar to the more frequent random grouping (DECC-ML) [14], which is often used for grouping interacting variables for function optimization problems. Equation (10) also holds for RFG-3, which is adapted from [14].  Figure 4 shows the probability of grouping interacting features randomly into a single subcomponent with the increase in the number of interacting features. As seen in the case of RFG-1, from Fig. 4, it can be observed that if the number of interacting features is 10, the probability of grouping these features tends to be zero. However, with the same number of interacting features, the probability is increased for at least one cycle significantly with more individuals.

Results and discussions
To evaluate our algorithms, experiments have been performed on seven datasets, which have been collected from the publicly available UCI machine library repository. Table 3 lists the datasets used in the experiments. The seven different datasets have been used with increasing complexities. The datasets have been selected with a dimensionality between 30 and 2000 and samples between 62 and 8992. These include datasets with a low number of features, datasets with a high number of samples, and datasets with a high number of features and datasets with a low number of samples. Dataset descriptions can be found in the UCI ML Repository and Princeton University Genomics Repository.

The CC parameters
The common CC parameters used in the experiments are listed in Table 4.
The problem decomposition parameters used in the experiments are listed in Table 5.
The common GA parameters used in the experiments are listed in Table 6.
In the case of GA optimization, the binary representation of the population is used, in which a binary 1 indicates that a feature is selected and a binary 0 indicates that a feature is not selected from the dataset. Subpopulations were initialized randomly at generation 0. For CCEAFS and CCFSRFG-1, in generation 0, since there is no previous history, to evaluate an individual in a subpopulation, random collaboration has been performed     parameters for weightings w 1 and w 2 were set to 0.6 and 0.4, respectively. A maximum number of 10,000 generations were allowed as a terminating condition for the evolutionary process while verifying that there is no further improvement in the fitness value. In the case of CCEAFS, the lack of further improvement check was checked over 50 consecutive generations, and for CCFSRFG, over 100 consecutive generations.

Evaluation measures
The quality of a machine learning model is usually measured by multiple evaluation metrics. In this article, these are accuracy, sensitivity, and specificity.

Performance evaluation of classifiers with CCFSRFG
For the experiments, six widely used classifiers (NB, SVM, k-NN, J48, RF, and LR) have been used. First, the datasets have been tested with these classifiers without feature selection. Second, CCFSRFG-1 and CCFSRFG-2 have been used to reduce the number of features in the datasets. All six classifiers have been used to evaluate the performance of dimensionality reduction, i.e., feature selection for five datasets and because of higher computational resource requirement, only naïve Bayes has been applied to the QSAR Oral Toxicity and Colon Cancer Dataset. Figures 6,7,8,9,10,11,12 show the comparative performance evaluation of classifiers using FS on each dataset in terms of accuracy, sensitivity, and specificity. Simulation results from Fig. 6 on the Wisconsin Breast Cancer Dataset show that the use of RFG in CC decomposition significantly improves the performance of both variants of the CCFSRFG framework compared to the static decomposition of CCEAFS.
It can be observed that all classifiers are equally good in terms of accuracy, sensitivity, and specificity when using CCFSRFG compared to using CCEAFS. The highest classification accuracy of 96.49% was achieved by the LR classifier when combined with CCFSRFG-2. The specificity measure was similarly better for CCFSRFG variants than the specificity obtained using CCEAFS. With the exception of SVM + CCF-SRFG-2, the sensitivity obtained by the CCFSRFG variants was better than that of CCEAFS. While CCFSRFG-1 and CCFSRFG-2 both perform comparatively better than CCEAFS, CCFSRFG-1 is the better of the two in most cases.
The performance measures of all classifiers using FS on the Dermatology Dataset in Fig. 7 indicate that with the exception of the k-NN and J48 classifiers, all classifiers perform better with RFG variants than the static decomposition of CCEAFS.
The classification accuracy for this dataset was obtained by NB + CCFSRFG-2, which is 97.54%. With k-NN + CCFSRFG-2, a 3% reduction in accuracy was observed, and a similar reduction (about 2%) was noted with J48 + CCFSRFG-1. A few cases, for example, SVM + CCFSRFG-1 and RF + CCFSRFG-1, resulted in a 100% sensitivity. In the case of k-NN + CCFSRFG-2 and LR + CCFSRFG-2, not only sensitivity was 100%, but also specificity. According to the performance results of all classifiers with an FS on Divorce Dataset (see Fig. 8), it can be seen that in terms of accuracy, NB, SVM, k-NN, J48, and RF in collaboration with CCFSRFG-1 perform better than the other variant and CCEAFS.
In the case of LR, it equally performs good in terms of accuracy using CCEAFS and CCFSRFG-1. The highest classification accuracy achieved by most of the classifiers was almost 100%. In a few cases, for example, NB and SVM, the same accuracy was achieved by CCEAFS and CCFSRFG-2. It is clear that except k-NN + CCFSRFG-2, J48 + CCFSRFG-2, and LR + CCEAFS, a 100% sensitivity was achieved by the feature selection process. However, the specificity achieved is smaller compared to other performance measures, which is especially true for SVM + CCFSRFG-2 (65.24%). Figure 9 illustrates the performance results of all classifiers with feature selection on Musk Dataset.
From this simulation, it can be observed that the highest classification has been obtained by the RF classifier when applied with CCFSRFG-1 (97.56%). In terms of static decomposition (CCEAFS) and dynamic decomposition (CCFS with a variant of RFG), it can be seen that in most cases, the dynamic decomposition perform better than the static decomposition in all measures, except two cases in which the performance measures have not been reduced significantly (e.g., NB + CCFSRFG-2 and RF + CCFSRFG-1). Apart from classification accuracy, the sensitivity achieved by all classifiers was above 90%; however, specificity was only between 54.77 and 88.30%.
The performance evaluation results of classifiers on the Pulmon Dataset are displayed in Fig. 10.
The figure shows that a 100% classification can be achieved by SVM, k-NN, and LR classifiers when they are used with a dynamic decomposition (RFG-1 or RFG-2) technique. In all cases, CCFSRFG-1 performs better than CCEAFS except by RF where it equally performs good with CCEAFS. However, CCFSRFG-2 performs equally better only in the case of the SVM classifier. In a few cases, CCFSRFG-2 performs slightly better than CCEAFS. With the exception with J48 classifier, in all other cases, the sensitivity and the specificity obtained by each classifier were 100%.
The NB classifier performance on the QSAR Oral Toxicity Dataset in Fig. 11 shows that in terms of accuracy and sensitivity, all feature selection approaches are equally good. However, it also shows that the RFG-1 variant in the CCFS framework performs better in terms of the aforementioned measures. In contrast, the CCFSRFG-1 results in a much lower specificity compared to CCEAFS and CCFSRFG-2. The highest accuracy, sensitivity, and specificity achieved by the NB classifier on this dataset were 92.55%, 97.62%, and 58.03%, respectively. Figure 12 shows the performance of the NB classifier on the Colon Cancer Dataset in terms of accuracy, sensitivity, and specificity.
In this case, the random grouping with its two variants in CCFS framework perform equally better than the static decomposition in the CCEAFS framework. Within the CCFSRFG variants, CCFS with RFG-1 performs much better than the CCFS with RFG-2. The accuracy, sensitivity, and specificity obtained by CCFSRFG-1 were 88.71%, 82.50%, and 100%, respectively.
The experimental results for all of the datasets used in this paper are summarized in Tables 7,8,9,10,11,12,13. Table 7 shows the summary results for the Wisconsin breast cancer dataset. The table indicates that all classifiers are equally good in terms of accuracy, sensitivity, and specificity measures when they are used with CC-based FS framework with RFG-1 and RFG-2, with an exception of SVM, the classifier performs better without using feature selection. It can be observed that the feature reduction rate varies among the classifiers. For example, both CCEAFS and CCFSRFG-1 reduce features in the dataset by 93.33% with the NB, RF, and LR classifiers, and 90% with SVM and k-NN. Although J48 can reduce features in the dataset by 96.67% when combined with CCEAFS, a higher performance can be achieved when combined with CCFS and RFG variants. In most cases, CCFSRFG-2 achieves higher performance results, however, the feature reduction rate is not lower compared to the other variant and CCEAFS.
The summary of results for the Dermatology Dataset is listed in Table 8. From the table, it can be observed that all classifiers are equally good in terms of accuracy, sensitivity, and specificity, regardless whether they are used with or without CCFS. However, when the classifiers are used with CCFS, a feature reduction rate between 70.59 and 85.29% can be achieved. CCFSRFG-1 reduces the number of features by 1 to 3 features; however, CCFSRFG-2 cannot reduce the number of feature significantly to CCEAFS. From the summary results of the Divorce Dataset in Table 9, it can be observed that all classifiers result in a higher performance in terms of accuracy, sensitivity, and specificity when used with CCFSRFG-1, and the number of features is reduced to only 2, except for SVM, where the number of features is 3. The rate of feature reduction varies between 75.93 and 96.30%.
The summary of results for the Must Dataset in Table 10 indicates that in terms of accuracy, sensitivity, and specificity, NB, k-NN, and J48 can achieve the highest classification accuracy with a much reduced number of features when combined with CCFSRFG-1.
The other classifiers can also reduce the number of features to a very low level without significantly reducing classification accuracy and other measures. The maximum number of features can be reduced by RF + CCFSRFG-1 (from 166 to only 7) with a reduction rate of 95.78%. Overall, the feature reduction rate by the classifiers varies from 87.95 to 95.78%, when they are used with a feature selection process with RFG-1, which clearly indicates superior performance over CCEAFS approaches. While the number of feature reduction by CCFSRFG-2 lies between 56 to 65, classification accuracy is higher than CCEAFS in most cases, although at the cost of more number of features for CCFSRFG-2 compared to CCEAFS. When the dataset has a large number of features and a low number of samples, as in the case of Pulmon Dataset, it can be seen from the summary results in Table 11 that CCFSRFG-1 can reduce the number of features much more than CCEAFS.
k-NN + CCFSRFG-1 can reduce the number of features from 432 to only 25 (feature reduction rate of 94.21%) with a 100% accuracy. This accuracy can also be achieved by SVM + CCFSRFG-1, SVM + CCFSRFG-2, and LR + CCFSRFG-1; however, the feature reduction rate is less than that of k-NN + CCFSRFG-1. With the exception by J48, all classifiers achieved a 100% sensitivity and specificity.
With a large number of features (1024) in QSAR Oral Toxicity Dataset, it can be seen from the summary results in Table 12 that the NB classifier can reduce the number of features substantially (to only 60) when NB is combined with CCFSRFG-1.
With this framework, NB also can obtain a higher classification rate of 92.55% compared to the 79.78% of NB without using any feature selection. NB can reduce the number of features to a very lower number whether used with a CC-based FS framework using static decomposition method or with a dynamic decomposition method, such as RFG. However, it can be clearly predicted that NB performs very well when it is used with CCFSRFG-1, compared to other feature selection combinations. The summary results of the Colon Cancer Dataset are recorded in Table 13. Similar to the QSAR Oral Toxicity Dataset, the Colon Cancer Dataset also has a large number of features. This is why NB + CCFSRFG-1 performs much better in terms of accuracy, sensitivity, and specificity than NB + CCEAFS or NB + CCFSRFG-2. A 96.35% feature reduction can be observed for NB + CCFSRFG-1 on the Colon Cancer Dataset.
According to the experiments, when the dimensionality of a dataset is reduced using feature selection, the performance measures in terms of accuracy, sensitivity, and specificity are not degraded in a significant rate. In a few cases, it can be observed that the classifiers perform better with feature selection than without it. Because of the search algorithm, the selected subset of features is not unique for every classifier. If the subset of features evaluated by two different classifiers is the same, the underlying search algorithm returns the same subset of features with the two independent executions. The classifiers are only used to evaluate the selected subset of features, thereby improving classification accuracy in each generation. It can also be seen that random feature grouping (RFG) introduced in this paper significantly improves classification performance. As studied in the literature, decomposition strategy plays an important role in the performance of a CC-based framework. The experimental results obtained here confirm the role of dynamic decomposition methods. In this paper, three variants of RFG have been proposed, the first two of which tested using all seven datasets. The second variant, RFG-2, which allows a dataset to be decomposed in every generation of an evolutionary   Although RFG-2 achieves a higher classification performance in a few cases this comes at a cost of more features than with RFG-1 or CCEAFS. RFG-2 could not perform better than RFG-1 and static decomposition because of the collaboration technique used (i.e., an individual from a subpopulation with the collaboration of individuals from the rest of each subpopulation) for the problem. In the case of CCEAFS and CCFSRFG-1, in the very first generation (where there is no previous information available), a random collaboration technique can be used to perform the collaboration. From the second generation onwards, the best individuals at the time can be used from the previous generation to perform the collaboration for all individuals. However, in the case of CCFSRFG-2, for every generation, only a random collaboration technique is used to perform the collaboration in every generation. The best individuals at the time cannot be used from the previous generation as collaborators for subsequent generations for CCFSRFG-2 because of the random group components in every generation. Random group components (i.e., the features in a subpopulation) are usually different from the group components of previous generations. Hence, if the best individual from a previous generation collaborates with an individual in the current generation, the same features might collaborate twice or multiple times, which results in a contradictory situation, because the features represented as both 0 and 1 would indicate not being selected and being selected at the same time in the subset. Thus, when the same feature is included in collaborating individuals twice or multiple times with the same or different representation, a formation of a complete solution is not possible. This concludes that beyond an appropriate decomposition method, a suitable collaboration technique is also needed for a CC-based FS framework, which is yet to be investigated. Alhough RFG-3 was proposed as a third variant of random feature grouping this also relies on a proper collaboration technique. Feature selection is an essential preprocessing step in various Big Data analytics, for example, anomaly detection for cybersecurity data. Cybersecurity data analysis is very important to indicate vulnerabilities and unveil security breaches, such as via detecting network inconsistencies. Examples of applying feature selection before anomaly detection include a hierarchical feature selection for DDoS mitigation [71], an ensemble feature selection for intrusion detection [72], and clustering and correlation-based feature selection for intrusion detection [73]. Hence, our proposed cooperative co-evolutionbased feature selection with the proposed random feature grouping (CCFSRFG) can be applied in any domain of Big Data.

Conclusion and future work
Following an extensive literature review on problem decomposition approaches using CC, this paper investigated the application of CC with a dynamic decomposition for feature selection. It proposed a random feature grouping algorithm for feature selection using CC, and evaluated its performance of six ML classifiers on seven datasets. The experiments indicated that the feature selection process does not degrade the performance of the classifiers significantly. We also analyzed the effect of feature selection on different datasets, covering datasets with many samples and few features, as well as datasets with few samples and many features.
The comparative analysis of the performance results of the classifiers in terms of accuracy, sensitivity, and specificity validate the effectiveness of the proposed CC-based feature selection algorithm (CCFSRFG) with the novel dynamic decomposition technique, random feature grouping (RFG), for feature selection problems. In most cases, the proposed FS approach with RFG-1 results in a better performance than the static decomposition approach (CCEAFS) and when using without feature selection. Although a suitable problem decomposition technique for feature selection may result in better performance, a suitable decomposition strategy by itself is not sufficient to achieve the full potential of CC-based approaches. CC-based approaches also require an appropriate optimizer to evolve each subpopulation and a proper collaboration technique to build a complete solution. For example, CCFSRFG-2 may perform better with efficient feature reduction when using other collaboration model. Therefore, as future work, a proper collaboration model will be investigated for the proposed CCFSRFG approach. In addition, CCFSRFG will be combined with other optimizers.