An enhanced random forest approach using CoClust clustering: MIMIC-III and SMS spam collection application

Ilhan Taskin, Zeynep; Yildirak, Kasirga; Aladag, Cagdas Hakan

doi:10.1186/s40537-023-00720-9

Research
Open access
Published: 30 March 2023

An enhanced random forest approach using CoClust clustering: MIMIC-III and SMS spam collection application

Zeynep Ilhan Taskin¹,
Kasirga Yildirak² &
Cagdas Hakan Aladag³

Journal of Big Data volume 10, Article number: 38 (2023) Cite this article

2098 Accesses
4 Citations
Metrics details

Abstract

The random forest algorithm could be enhanced and produce better results with a well-designed and organized feature selection phase. The dependency structure between the variables is considered to be the most important criterion behind selecting the variables to be used in the algorithm during the feature selection phase. As the dependency structure is mostly nonlinear, making use of a tool that considers nonlinearity would be a more beneficial approach. Copula-Based Clustering technique (CoClust) clusters variables with copulas according to nonlinear dependency. We show that it is possible to achieve a remarkable improvement in CPU times and accuracy by adding the CoClust-based feature selection step to the random forest technique. We work with two different large datasets, namely, the MIMIC-III Sepsis Dataset and the SMS Spam Collection Dataset. The first dataset is large in terms of rows referring to individual IDs, while the latter is an example of longer column length data with many variables to be considered. In the proposed approach, first, random forest is employed without adding the CoClust step. Then, random forest is repeated in the clusters obtained with CoClust. The obtained results are compared in terms of CPU time, accuracy and ROC (receiver operating characteristic) curve. CoClust clustering results are compared with K-means and hierarchical clustering techniques. The Random Forest, Gradient Boosting and Logistic Regression results obtained with these clusters and the success of RF and CoClust working together are examined.

Introduction

The random forest technique is an effective and popular method to solve classification and regression problems based on decision trees. It is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. Random forest (RF) has been used in biology and medicine, such as high-dimensional genetic or tissue microarray data and MIMIC-III [1,2,3,4,5,6]. It is specifically devised to operate quickly and efficiently over large datasets because of the simplification and it offers the highest prediction accuracy compared to other models in the setting of classification.

The main contribution of this study is to increase the speed and accuracy of RF by adding a new feature selection step. Especially when working with big data, it is very important to increase speed and accuracy by using a correct clustering method. Correct determination of the dependency between variables in the feature selection step is one of the most critical steps of the study. Although there is an expectation of linear dependence in the studies, nonlinear dependence is also frequently encountered. The efficient operation of the clustering method used in nonlinear dependence is one of the side benefits of the study. Working with non-linear dependency during the correct determination of the relationship between variables is one of the side benefits of the article. One of the popular methods used in analyzing nonlinear dependencies is copulas. The main advantage of the proposed approach using CoClust is to achieve high accuracy in big data in a short time.

Copula-Based Clustering technique called CoClust, which examines dependencies using copulas, is an alternative to classical clustering techniques. It overcomes linear dependency constraints. In this technique, the power and type of multivariate dependency between sets are modeled with a copula function and dependency parameter.

In the feature selection step, the determination of nonlinear dependency is emphasized, and copulas are preferred. CoClust gives effective results by clustering variables that show nonlinear dependency using copulas. We mainly work on the feature selection phase employing CoClust rather than regular feature selection methods and show that high efficiency in terms of CPU time and prediction is obtained from this version of RF because CoClust implies the noninclusion of the uncorrelated variables in clusters.

The data-oriented purpose of our work lies in the use of a more efficient prediction model for mortality prediction and spam SMS classification through copulas and CoClust. It also aimed to develop a different approach for mortality prediction in intensive care patients and spam SMS classification by examining the nonlinear dependency structure between variables through copulas.

This method proposed for the Random Forest method is also applied in other classification techniques such as Gradient Boosting and Logistic Regression, and the results are evaluated in terms of both other clustering methods and machine learning methods.

Another important aspect of the study is that it works with two large datasets, MIMIC-III (Medical Information Mart for Intensive Care) and SMS Spam Collection. The MIMIC-III is a large free access database including more than forty thousand patients who were treated at the intensive care units of Beth Israel Deaconess Medical Center between 2001 and 2012. MIMIC-III, the latest version of MIMIC, includes the hospital records of 46520 patients, 38645 of whom are adults and 7875 newborns.

Examining the proposed method in a dataset with a large number of variables is another important step of the study. In this context, the SMS Spam Collection dataset is used, which helps short message services classify messages as spam. While the number of text messages used is 5574, the number of variables is remarkable in this dataset. The dataset consists of 770 variables. For the feature selection step, testing the proposed method on a dataset consisting of many variables expands the vision for comparison.

In this context, a literature review of the techniques used is clarified in “Literature review” Section. CoClust, RF and the proposed approach for RF are explained in “The proposed approach for random forest” Section. “Datasets” Section presents the experimentation of data sampling. In “Application” Section, the results obtained by applying the proposed approach are presented. “Discussion” Section and “Conclusion” Section focus on the discussion and conclusion of the application.

Literature review

RF is a flexible, easy-to-use machine learning algorithm that often produces a great result and mainly depends on the celebrated method so-called classification and regression trees (CART). Breiman [7] provided an early example of bagging with random selection to grow each tree without replacement. Dietterich [8] and Ho [9] make use of random subspace and random split selection.

Breiman [10] uses new training sets by randomizing the outputs in the original training set. He defines the RF as a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. He also suggests that some or all of the input variables may be categorical, and since it is wanted to define additive combinations of variables, it is necessary to define how categorical variables will be treated so they can be combined with numerical variables. [11].

Mistry et al. [5] draw attention to classifiers that allow us to predict which tool will be most suitable for reducing the toxicity of a drug. They demonstrate the use of data mining and machine learning techniques by examining models using RF and decision trees. Accordingly, an accuracy of 80% is obtained from the RF models. Thus, RF gives efficient results in the field of health.

The use of RF in mortality predictions also has an important place in the literature. Levantesi and Nigri [3] propose a novel approach based on the combination of RF and two-dimensional P-spline. The two-dimensional P-spline is used to smooth and project the RF estimator in the forecasting phase. All the analyses were carried out on data from the Human Mortality Database and considering the Lee–Carter model.

RF could be used in biology and medicine, such as high-dimensional genetic or tissue microarray data [12, 13]. The RF technique has also been studied on MIMIC, which is an important database. Thus, remarkable studies have emerged for both RF and MIMIC databases.

The RF technique has also been studied on MIMIC, which is an important database. Thus, remarkable studies have emerged for both RF and MIMIC databases. Poucke et al. [6] concentrate on quantitative analysis of the predictive power of laboratory tests and early detection of mortality risk by using predictive models and feature selection techniques in the MIMIC-III database. RF and logistic regression were used on patients with renal failure admitted to ICUs at Boston’s Beth Israel Deaconess Medical Center.

McWilliams et al. [4] object to developing an automated method for detecting patients who are ready for discharge from intensive care. Two cohorts derived from the GICU and MIMIC-III were analyzed with RF and a logistic classifier.

Another important step regarding RF is the feature selection phase. Although RF inherently enables feature selection, using different techniques in feature selection sheds light on the RF technique and literature.

Hapfelmeier and Ulm [14] claim that feature selection has been suggested for RF to improve data prediction and interpretation. Three approaches to selecting variables, i.e., Multiple imputations, complete case analysis and the application of a self-contained measure are applied to half of the data. In the rest of the study, unbiased RF is preferred.

Uddin and Uddin [15] propose a feature selection method based on guided RF. The guided RF is used to select a small set of important variables. First, an ordinary RF is trained on the dataset to collect the feature importance scores, and then, the collected importance scores are injected to influence the feature selection process in the guided RF.

Gupta [16] uses three approaches (wrappers, filters, embedded methods) for feature selection, and then four machine learning models are used to solve classification problems. RF is one of these methods. The highest accuracy of 56.99% is achieved with the RF model.

Copulas are first used by Abe Sklar [17]. Sklar’s theorem elucidates the role that copulas play in the relationship between multivariate distribution functions and their univariate margins [18]. It expresses that any multivariate joint distribution can be written on the basis of univariate marginal distribution functions and a copula that describes the dependence structure between the variables [19].

Mesiar and Sheikhi [20] emphasize the importance of nonlinear dependence in their studies and offer a solution to the problem through copulas. In this study, the simulated data are obtained through copulas and each of them is placed in a correlated cluster. However, in CoClust, if the variable is not related, it is excluded from the clusters, which is one of its most important distinguishing features.

Although copulas are used in many areas, the introduction to CoClust is with Di Lascio [21]. The study that introduced the CoClust technique into the field is a doctoral dissertation that develops the technique further on clinical microarray data analysis [22], Di Lascio [21]. In 2017, the development of forty European countries was examined by healthy nutrition rules with CoClust, and later in 2019, they examined the improved version of the technique (Di Lascio, Durante, and Pappada [23]; Di Lascio and Giannerini [24].

CoClust, based on copula functions, allows clustering of observations according to multivariate dependency structure without any assumption on marginals. The basic idea behind CoClust is that the row data matrix separates the K group at once, that is, it creates an advanced procedure that separates the p-dimensional vector for each set (Di Lascio [21]).

Di Lascio [21] also compared CoClust with another well-known clustering technique based on probability models and found that the latter is not able to model the true dependence relationship between observations.

There are also studies in the literature in which copulas and decision trees are used together. Khan et al. [25] bring a joint approach to copulas and decision trees. They appraised a novel nonparametric copula-based decision tree organization using a measure of dependence and applied their proposed method to credit card records for Taiwan and coronary heart disease records of Pakistan and acquired the desirable outcomes. As a result of the application, the desired results are obtained.

Eling and Toplek [26] and Messiar and Sheikhi [20] emphasize the importance of nonlinear dependence in their studies and offer a solution to the problem through copulas. In this study, a solution proposal to this problem is presented by using the nonlinear dependency skill of CoClust.

Zhu et al. [27] aim to establish prediction scores on mechanically ventilated patients in ICU and they use the machine learning methods of k-nearest neighbors, logistic regression, bagging, decision tree, random forest, Extreme Gradient Boosting, and neural network for model establishment. The efficiency of the resulting models is measured via AUC and a value of AUC is reached 0.819 with RF.

Khope and Elias [28] examine the MIMIC-III data set over KNN, LR and ANN and compare the results obtained with the confusion matrix on accuracy.

Based on the literature review, it is decided to use CoClust and RF techniques together. Thus, by applying the feature selection step with CoClust, it is possible to work on the goal of achieving more perfect accuracy in a shorter time. The methods mentioned in the literature review are explained in “The proposed approach for random forest” Section.

The proposed approach for random forest

CoClust brings a different perspective to the literature by using copulas in the clustering technique. In this study, a novel approach is proposed by adding CoClust to RF as a feature selection step. In the proposed approach, clusters are formed by considering the dependency between variables with CoClust, and then the most efficient model is obtained with RF by using the relevant variables. Thus, it aims to bring a different approach to the feature selection phase of RF. These techniques utilized in the proposed method are explained in this section.

CoClust

CoClust was introduced by Di Lascio in [21] through her doctoral thesis, developed in 2017, and the final version of the technique was presented by Di Lascio and Giannerini [24].

CoClust includes copula families in the clustering algorithm. It refers to the clustering of multivariate dependent variables based on the likelihood copula function. CoClust assumes that the data are derived from the multivariable copula function, which is known to represent each cluster by the marginal function. The power and type of multivariate dependency between clusters are modeled by the copula function and the dependency parameter of the copula, respectively.

The copula function is defined as “functions that join or couple multivariate distribution functions to their one-dimensional marginal distribution functions” by Nelsen [18].

The copula function was first handled by Abe Sklar in [17] as a function that depends on univariate marginals to multivariate distributions within the scope of probable metric spaces [29].

Consider for a moment a pair of random variables X and Y, with distribution functions F(x) = P(X ≤ x) and G(y) = P(Y ≤ y), respectively, and a joint distribution function H(x, y) = P(X ≤ x, Y ≤ y). For each pair of real numbers (x, y), we can associate three numbers: F(x), G(y), and H(x, y). Each of these numbers lies in the interval [0,1]. In other words, each pair (x, y) of real numbers leads to a point (F(x), G(y)) in the unit square [0,1] × [0,1], and this ordered pair in turn corresponds to a number H(x, y) in [0,1]. This correspondence, which assigns the value of the joint distribution function to each ordered pair of values of the individual distribution functions, is indeed a function. Such functions are copulas [18].

Let H be a joint distribution function with margins F and G. A copula C is defined in Eq. 1 for all x, y ∈ $\overline{R}$[18].

$$H(x,y) = C(F(x),G(y))$$

(1)

According to Sklar’s theorem, any joint probability function f(.) can be split into the margins and a copula. For continuous random variables, the copula density c(·) is related to the density f (·) of the distribution F(·) through the well-known canonical representation and can be presented in Eq. 2 (Di Lascio, Durante, and Pappada 2017).

$$f\left( {x_{1} , \ldots ,x_{K} } \right) = c\left( {F_{1} \left( {x_{1} } \right), \ldots ,F_{K} \left( {x_{K} } \right)} \right)\mathop \prod \limits_{k = 1}^{K} f_{k} \left( {x_{k} } \right)$$

(2)

Such separation determines the modeling flexibility given by copulas since it is possible to decompose the estimation problem in two steps: in the first step, margins are estimated; and in the second step, the copula model is estimated. The most commonly used estimation method is the two-stage inference for margins method [30], which employs the log-likelihood estimation method to estimate both the parameter(s) of each margin and the copula parameter θ. This method can be used in a semiparametric approach (Genest, Ghoudi, and Rivest [31]) that does not require distributional assumptions on the margins. The log-likelihood copula function is used to estimate θ in Eq. 3 (Di Lascio, Durante, and Pappada 2017).

$$\hat{\theta } = argmax_{\theta } \mathop \sum \limits_{i = 1}^{n} logc\left\{ {\hat{F}_{1} \left( {X_{1i} } \right), \ldots ,\hat{F}_{k} \left( {X_{Ki} } \right);\theta } \right\}$$

(3)

The concept of CoClust refers to the aggregation of multivariate dependent variables based on a log-likelihood function of the copula model. To realize this clustering, CoClust assumes that the parameters of the data are derived by the multivariate copula function, which represents clusters, and each cluster is known to be represented by the univariate density function. The power and type of multivariate dependency between clusters are modeled by a copula function and dependency parameter of the copula, respectively.

The beginning of the algorithm is an (n x p) data matrix X. It is expressed by Eq. 4.

$$X = \left[ {\begin{array}{*{20}c} {x_{11} } & \ldots & {x_{1j} } & \ldots & {x_{1p} } \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ {x_{i1} } & \ldots & {x_{ij} } & \ldots & {x_{ip} } \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ {x_{i^{\prime}1} } & \ldots & {x_{i^{\prime}j} } & \ldots & {x_{i^{\prime}p} } \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ {x_{n1} } & \ldots & {x_{nj} } & \ldots & {x_{np} } \\ \end{array} } \right]$$

(4)

The purpose of clustering is to group the (n x p)-dimensional dataset into a K cluster.

Values in a row (or column) vector are independent functions of the same density function, so the observations in each set are from the same distribution. Here, the algorithm is described as applying the data matrix to the rows (Di Lascio [21]).

The main steps of the CoClust algorithm required for clustering the n row data matrix are explained as follows (Di Lascio and Giannerini [24]).

1.
for k = 2…., K_max, where K_max ≤ n is the maximum number of clusters to be tried:
1. a.
  select a subset of n_k k-plets of rows/profiles in the data matrix on the basis of the following multivariate measure of association based on pairwise Spearman’s ρ correlation coefficient in Eq. 5.
  $$H\left( {\Lambda_{2} |\Lambda_{1} } \right) = \mathop {\max }\limits_{{i^{^{\prime}} \in \Lambda_{2} }} \left\{ {\mathop \psi \limits_{{i \in \Lambda_{1} }} \left( {\rho \left( {x_{i} ,x_{i^{\prime}} } \right)} \right)} \right\}$$
  (5)
  
  In Equation 5, Λ is a set of row index profiles such that Λ = Λ₁ ∪ Λ₂, Λ₁ is the subset of profiles already selected to compose a k-plet, Λ₂ is the set of remaining candidates to complete a k-plet, x_i is the ith profile, ψ is a selected function among the mean, the median or the maximum;
2. b.
  fit the copula model on the n_k k-plets of profiles/rows through the maximum pseudolikelihood estimation.
2.
select the subset of n_k k-plets of rows/profiles, say n_K K-plets, that maximizes the log-likelihood copula function; hence, the number of clusters K, i.e., the dimension of the copula, is automatically chosen;
3.
select a K-plet using the measure in Eq. (5) and estimate K! copulas by using the observations already clustered and a permutation of those candidates to the allocation;
4.
allocate the permutation of the selected K-plet to the clustering by assigning each observation to the corresponding cluster if it increases the log-likelihood of the copula fit; otherwise, drop the entire K-plet of rows/profiles;
5.
repeat steps 3. and 4. until all the observations are evaluated (either allocated or discarded).

Since nonnested models are tested at every step of the algorithm, that is, working with copula models using univariate dependency parameters, the defined log-likelihood-based criterion is equivalent to the Bayesian information criterion and the Akaike information criterion (Di Lascio [21]).

The Bayesian information criterion is defined as Eq. 6 for the K-dimensional copula model m (Di Lascio [21]).

$$BIC_{K,m} = - 2{\text{log}}\mathop \prod \limits_{i = 1}^{n} c_{m} \left\{ {\hat{F}_{1} \left( {X_{1i} } \right), \ldots ,\hat{F}_{K} \left( {X_{Ki} } \right);\hat{\theta }} \right\} + s{\text{log}}\left( {\left( {n/K} \right)p} \right)$$

(6)

Accordingly, the model of the copula that minimizes the BIC value is selected. Similarly, the Akaike information criterion (AIC) is expressed in Eq. 7 and is used to select the model of the copula (Di Lascio [21])

$$AIC_{K,m} = - 2{\text{log}}\mathop \prod \limits_{i = 1}^{n} c_{m} \left\{ {\hat{F}_{1} \left( {X_{1i} } \right), \ldots ,\hat{F}_{K} \left( {X_{Ki} } \right);\hat{\theta }} \right\} + 2s$$

(7)

In the technique, clusters, each containing a maximum number of (n/K)p independent observations, are obtained. The configuration of multivariate relationships here is not based on intracluster relationships in classical clustering methods. Although each cluster is independent identical distributions obtained from the same marginal distribution, intercluster observations share the same multivariate dependency structure (Di Lascio [21]). Thus, each cluster is generated by a (marginal) univariate density function, and the interpretation of the clustering is based on within-group independence and among-group dependence (Di Lascio, Durante, and Pappada [32]). In classical clustering methods, elements that are correlated with each other are in the same cluster and are expressed in this way in tables. However, the situation is the opposite in CoClust tables. Di Lascio and Disegna [33] explain that the CoClust aims to describe the within-cluster independence and the between-cluster dependence instead of the within-cluster homogeneity and the between-cluster separation, as the more traditional clustering approaches. Therefore, in order not to create confusion for the reader, the expression of the sets in the tables is done as in the classical methods.

The most important advantage of the technique is that there is no need to set a priori the exact number of clusters K, nor is a starting classification required because the algorithm automatically selects the best number of clusters K within a given range of possibilities on the basis of the log-likelihood in Eq. 3 (Di Lascio and Giannerini [23]).

The other important feature of this technique is that it clusters only the variables that it identifies to be related, which means not all variables present are placed in clusters. Variables regarded as uncorrelated are kept outside of the clusters. In this respect, it differs from the tail dependency technique.

In the literature, many different copula models are available, but Nelsen [18] demonstrated that the elliptical and Archimedean families are the most useful in empirical modeling. The elliptical family includes the Gaussian copula and the t-copula. Both copulas are symmetric, and they can take into account both positive and negative dependence since − 1 ≤ θ ≤ 1. On the other hand, the Archimedean family enables us to describe both left and right asymmetry as well as weak symmetry among the margins by employing Clayton’s, Gumbel’s and Frank’s models. Clayton’s copula has the parameter θ ∈ (0, ∞), and as θ approaches zero, the margins become independent. The dependence parameter θ of a Gumbel model is restricted to the interval [1, + ∞]. where the value 1 means independence. Finally, the dependence parameter θ of a Frank copula may assume any real value, and as θ approaches zero, the marginal distributions become independent (Di Lascio, Durante, and Pappada [32]).

Di Lascio [21] tests the CoClust algorithm on simulated data drawn from Gaussian and Frank copulas in different situations and dependence settings. They found that the algorithm is able to recover the true underlying dependence relationship between observations grouped in different clusters irrespective of the kind of margins, the value of the dependence parameter and the copula model.

The CoClust algorithm has been successfully applied to various datasets. Di Lascio et al. [32] attempted to determine the type of organs from tumors and cancer cell lines. Regarding biomedical applications, Di Lascio and Giannerini [24] applied the CoClust algorithm to formulate the possible functional relationship between genes with hypotheses. Di Lascio et al. [32] study can be given as an example for applications in other fields. The aim of this study is to analyze changes in EU country diets under the guidance of health diets and common European policies, and Di Lascio et al. [32] use them to investigate the geographic distribution of precipitation measurements.

The other important feature of this technique is that it clusters only the variables that it identifies to be related, which means not all variables present are placed in clusters. Variables regarded as uncorrelated are kept outside of the clusters. In this respect, it differs from the tail dependency technique.

Random forest

RF is a technique based on decision trees that uses rules to split data in a binary method (Ji, Yang, and Tang [34]). In the literature, when solving classification problems, the Gini index, deviance and the towing rule are used for the best split [35].

Breiman [11] defines that RF is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. Decision trees come together to form an RF. Each tree is a randomly selected subset from the dataset.

An RF is an ensemble classifier consisting of many decision trees, where the final predicted class for a test example is obtained by combining the predictions of all individual trees [11]. Each node is partitioned based on a single feature, and each branch ends in a terminal node. Terminal nodes provide a prediction for the class of a test example based on the path taken through the tree. The color of a terminal node indicates its class prediction. The final predicted class for a test example is obtained by combining the predictions of all individual trees [36].

In other words, many classification and regression trees are generated and then the results are aggregated. Each tree is independently constructed using a bagging sample of the training data (Ji, Yang, and Tang [34]).

Additionally, the technique is not affected by the interactions of correlated variables because each tree comprises random samples [37].

In the training phase, X represents the object in the training dataset (an N x M matrix, where N is the number of training data and M is the number of variables); L represents the labels of the training set (an N × 1 matrix); n_tree represents the number of trees in the forests; θ_k represents each random tree in the random forests (k = 1, 2, …, n_tree); M_try represents the number of features randomly selected to split (Ji, Yang, and Tang [34]).

RF is an integrated classifier composed of multiple decision tree classifiers, which can be described as in Eq. 8.

$$h\left( {X, \theta_{k} } \right);k = 1, 2, \ldots , n_{tree}$$

(8)

At the end of the algorithm, the predictive capability of the RF model should be assessed. Various statistical parameters or cross-validation procedures are used to validate the performance of the proposed models [38, 39].

The RF method has two important products: out-of-bag estimates of the generalization error and variable importance measures [11]. Two algorithms for calculating variable importance measures differ somewhat from the four heuristics originally suggested for variable importance measures [11]. The first heuristic is based on the Gini criterion, and the second calculates variable importance as the mean decrease in accuracy using out-of-bag observations [40, 11]. The OOB observations can also be used to calculate variable importance, and Gini impurity represents the probability that a randomly selected sample from a node will be incorrectly classified according to the distribution of samples in the node [40].

To evaluate the classification ability and the performance of the model, parameters such as error (Er) and accuracy (Ac) are calculated, which are given in Eqs. 9 and 10. In the equations, TP, FP, TN and FN denote true positives, false positives, true negatives and false negatives, respectively. The relationship of these four factors can be best shown by the confusion matrix [41].

$$Er = \frac{FP + FN}{{TP + TN + FP + FN}}$$

(9)

$$Ac = \frac{TP + TN}{{TP + TN + FP + FN}}$$

(10)

Note that sensitivity is also called the true positive rate, defined as the ability and proportion of a classifier to correctly predict positively labeled molecules, while specificity is also called the true negative rate, defined as the capability and percentage of negatively labeled instances identified as negative [39, 42, 43]. Accuracy is the percentage coverage of correct predictions, generally applied to judge the predictive power of models [44].

There are other metrics such as F1 and Recall calculated with the confusion matrix. Recall score represents the model’s ability to correctly predict the positives out of actual positives. Recall is also known as sensitivity or the true positive rate and is given in Eq. 11. F1 score represents the model score as a function of precision and recall score. F-score is a machine learning model performance metric that gives equal weight to both the Precision and Recall for measuring its performance in terms of accuracy, making it an alternative to accuracy metrics. F1 score is also given in Eq. 12.

$$Recall = \frac{TP}{{TP + FN}}$$

(11)

$$F1 = \frac{2TP}{{2TP + FP + FN}}$$

(12)

Accuracy is a machine learning classification model performance metric that is defined as the ratio of true positives and true negatives to all positive and negative observations. In other words, accuracy tells us how often we can expect our machine learning model will correctly predict an outcome out of the total number of times it made predictions. For this reason, accuracy criterion is preferred in the interpretation of the models.

The second performance criterion aims at measuring the extent to which the RF model can distinguish between the classes, i.e., the ability of the RF model to rank the events with “y = 1” relative to those with “y = 0”. This can be evaluated using the receiver operating characteristic (ROC) curve [45]. The closer the curve is to the left-hand corner of the ROC space, the better the classification. The area between the first bisector and the ROC curve (denoted as AUC) allows the performance of the RF model to be quantified [46]. In other words, the AUC is a combined measure of sensitivity (true positive rate) and specificity (true negative rate) at various probability threshold settings. Since both the x and y axes have values between 0 and 1, it can take any value between 0 and 1. The closer the AUC is to 1, the better the overall diagnostic performance of the test, so it is expected to be as close to 1 as possible.

Random forest with CoClust

In this subsection, the proposed method combining RF and CoClust is introduced. The main purpose of this study is to achieve efficient results while reducing CPU time by adding a feature selection step to the RF technique. Thus, RF was developed, which is currently a powerful and effective method. In addition to this advantage, obtaining high-efficiency models in short CPU time and using this efficient result in estimating mortality and spam messages is another gain of the study.

CoClust works through copula families using the nonlinear dependency structure. It gives effective results by clustering variables that show nonlinear dependency using copulas. Nonlinear dependency is included by using CoClust in the feature selection step. This is one of the important advantages of the study. Therefore, using a new method such as CoClust, the traditional point of view has been viewed from an innovative perspective.

In addition to other feature selection techniques, the reasons for choosing CoClust are listed below.

• It does not require a starting classification to be chosen;
• It does not require the number of clusters to be set a priori;
• It is able to capture multivariate and nonlinear dependence relationships underlying the observed data;
• It does not require the marginal probability distributions to be set as Gaussian;
• It is able to discard irrelevant observations [32].

CoClust makes a difference, as it gives results that are completely compatible with the data structure without interfering with the data and results. When all the mentioned features are examined, it is clearly seen that the researcher cannot directly intervene in the process. It is very important in terms of the reliability and objectivity of the method.

The additional gains of this study are to bring a new approach to RF with the proposed method, to predict mortality by working with a large dataset and to correctly classify spam messages with fewer variables. Using the correct variables together is very important for predicting mortality in ICU patients. By observing the effect of the variables to be determined in the MIMIC-III dataset on mortality prediction with the proposed method, an important improvement will be made for ICU patients.

Based on this, first, the RF technique is applied to the MIMIC-III and SMS Spam Collection datasets without adding any steps. At this stage, forests of different sizes are created, and CPU time, accuracy and ROC curve results are recorded. In this step, all results obtained from RF are presented, and the efficiency of the method is emphasized again.

Then, feature selection is carried out with CoClust in the MIMIC-III and SMS Spam Collection datasets, and models are created with the RF method by using variables that passed the selection stage. In the CoClust step, Archimedean and Gaussian all copula families are used. Copula families resulting in clustering are used. As an advantage of the method, the choice of the number of clusters is left to CoClust without restricting the number of clusters. After clustering with CoClust, RF application is applied to the variables in the clusters. The study was repeated in forests of the same size for each cluster, and CPU time, accuracy and ROC curve values were recorded for the forests belonging to the clusters. At the end of modeling, the most efficient models are selected and evaluated according to the saved results.

The application of CoClust and RF techniques, whose theoretical background is explained, to the MIMIC-III and SMS Spam Collection datasets is explained in “Datasets” Section.

Datasets

The MIMIC-III and SMS Spam Collection datasets are used to determine the efficiency of the proposed method. In the following sections, datasets are introduced.

MIMIC-III dataset

The MIMIC used in the study is a large free access database including more than forty thousand patients who were treated at the intensive care units of Beth Israel Deaconess Medical Center between 2001 and 2012. This database includes demographic information, laboratory test results, the procedures applied, medications, caregiver notes, imaging reports, hourly records of vital signs and death variables [47].

MIMIC-III, the latest version of MIMIC, includes the hospital records of 46520 patients, 38645 of whom are adults and 7875 newborns. The latest data cover the period between June 2001 and October 2012. Although the database has not been identified, it contains detailed information about the clinical care of patients. MIMIC-III database is closed. The academic paper about database is here: https://physionet.org/content/mimiciii/1.4/. To get data from the database, proceed from this link: https://physionet.org/settings/credentialing/. In order to use the restricted-access clinical databases hosted on PhysioNet, users must have a credentialed PhysioNet account. If user doesn’t have credentialed account, s/he must apply for access from this link: https://physionet.org/credential-application/. In order to become a credentialed PhysioNet user and access the restricted-access clinical databases like MIMIC-III, you must complete a suitable training program in human research subject protections and HIPAA regulations. After these steps, personal information is completed. In our team, these processes are carried out by Prof.Dr. Kasırga Yıldırak completed.

From the database, fourty physiological and demographic variables were obtained. These variables are used in scores (SOFA, SAPS II, APACHE) used in intensive care patients. Vital signals such as blood pressure, temperature, and respiration have been proven to have a strong relationship with mortality [48]. In the literature, a variable pool is created by adding vital variables such as albumin, hemoglobin and glucose, which are associated with mortality [22, 48].

A death variable was used for the mortality model. Here, the death variable is the "no death" category, which is coded with reference category 0.

The variables that are used are given in Table 1.

Table 1 Variables used

An enhanced random forest approach using CoClust clustering: MIMIC-III and SMS spam collection application

Abstract

Introduction

Literature review

The proposed approach for random forest

CoClust

Random forest

Random forest with CoClust

Datasets

MIMIC-III dataset

SMS spam collection dataset

Application

Study design

Application of the proposed approach

Traditional RF and other classification methods without applying CoClust

Feature selection by CoClust

Feature selection by other methods

The results of the proposed approach with applying CoClust and RF with applying the other clustering methods

Gradient boosting and logistic regression results for clustering methods

Comparison of the results of the proposed approach with the results of other methods

Discussion

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords