Adaptive multiple imputations of missing values using the class center

Phiwhorm, Kritbodin; Saikaew, Charnnarong; Leung, Carson K.; Polpinit, Pattarawit; Saikaew, Kanda Runapongsa

doi:10.1186/s40537-022-00608-0

Research
Open access
Published: 28 April 2022

Adaptive multiple imputations of missing values using the class center

Kritbodin Phiwhorm¹,
Charnnarong Saikaew²,
Carson K. Leung³,
Pattarawit Polpinit¹ &
…
Kanda Runapongsa Saikaew¹

Journal of Big Data volume 9, Article number: 52 (2022) Cite this article

3451 Accesses
9 Citations
Metrics details

Abstract

Big data has become a core technology to provide innovative solutions in many fields. However, the collected dataset for data analysis in various domains will contain missing values. Missing value imputation is the primary method for resolving problems involving incomplete datasets. Missing attribute values are replaced with values from a selected set of observed data using statistical or machine learning methods. Although machine learning techniques can generate reasonably accurate imputation results, they typically require longer imputation durations than statistical techniques. This study proposes the adaptive multiple imputations of missing values using the class center (AMICC) approach to produce effective imputation results efficiently. AMICC is based on the class center and defines a threshold from the weighted distances between the center and other observed data for the imputation step. Additionally, the distance can be an adaptive nearest neighborhood or the center to estimate the missing values. The experimental results are based on numerical, categorical, and mixed datasets from the University of California Irvine (UCI) Machine Learning Repository with introduced missing values rate from 10 to 50% in 27 datasets. The proposed AMICC approach outperforms the other missing value imputation methods with higher average accuracy at 81.48% which is higher than those of other methods about 9 – 14%. Furthermore, execution time is different from the Mean/Mode method, about seven seconds; moreover, it requires significantly less time for imputation than some machine learning approaches about 10 – 14 s.

Introduction

Big data has become a critical technology for developing novel solutions in a wide variety of fields. For instance, large and complex amounts of structured and unstructured data are growing at high-speed rates [1]. The development of these big data will enhance the discovery of useful information, such as hidden patterns and unknown correlations, that can be useful in many fields, including healthcare, financial, manufacturing, and social life [2], such as the failure or malfunction of the sensors that provide information or partial observation of an object of interest because of some hidden phenomenon.

Many real-world applications suffer a common drawback, missing or unknown data. For example, some results may be lost in an industrial experiment due to mechanical faults during the data gathering procedure. Likewise, some tests cannot be done in medical diagnosis because some medical tests may not be appropriate for certain patients, or the medical report proforma permits the omission of specific qualities.

The quality of data [3] is a significant concern to them for conducting effective data analytics. Although the outcome of data analysis tasks depends on several factors such as attribute selection, algorithm selection, and sampling techniques, a critical dependency relies upon the efficient handling of missing values [4]. The data is either missing or incorrectly entered by a human, which results in an incorrect prediction [5, 6], as missing values degrade performance. Therefore, missing data is a significant issue in big data analytics, as it can significantly increase the cost of computation and skew the results [7]. As a result, data quality is a fundamental requirement for big data processing, and data quality suffers when missing values are present [8].

A data analysis algorithm cannot handle incomplete datasets directly by itself. The simplest way to deal with this problem is case deletion, which means directly removing all the data of cases with missing values [9, 10]. However, if the missing value rate is high, the deletion approach affects the remainder of the complete data and can reduce the accuracy of the results [11]. As a result, reliable imputation techniques are necessary to consider the matter of missing data. Additionally, imputation of missing data can aid in the maintenance of the completeness of a dataset, which is critical in small-scale data mining projects and big data analytics.

To date, missing value imputation (MVI) has been proposed as a promising solution for incomplete datasets [12,13,14,15,16,17,18,19]. MVI can be broadly classified into statistical and machine learning techniques. The mean and mode are common statistical MVI technique measurements that typically require a short time to compute. However, machine learning MVI techniques, such as support vector machine (SVM), and random forest (RF) methods, require a long computation time to achieve high accuracy [20,21,22,23]. On the other hand, the k-nearest neighbor (KNN) technique [24] requires much less imputation time than other machine learning techniques [25,26,27,28]. However, the KNN method performs only an online search of the nearest neighbors through the Euclidean distance function [29].

Among the KNN-based methods, Troyanskaya et al. [30] and Daberdaku et al. [31] presented a weighted KNN algorithm for the missing data imputation. Further, Cheng et al. [32] proposed a KNN method that used purity to enhance the performance of K nearest neighbors. Fan et al. [33] proposes the weighted KNN approach, which uses the inverse of the Euclidean distance as the weight for each data point. Of all these KNN-based weighted methods, the set of nearest neighbors is computed by the weight distance between the data of missing values and the complete data.

Sometimes, although more complex algorithms might produce better imputation results, they will generally require a higher computational cost, which is a consideration in machine learning techniques versus statistical techniques [34]. However, most machine learning techniques are usually more computationally expensive than many statistical techniques, due to the model training and construction process.

Recently, class center-based missing value imputation was proposed to produce relatively better imputation results at a lower computational cost [35,36,37]. The class center is based on the mean of the data samples in a specific class, which is similar to the idea of the cluster center (or centroid) applied in the k-mean algorithm [38]. Thereafter, the Euclidean distances between each data sample and the class center are measured, to define a threshold for the later imputation guideline in “Materials and methods” section.

This study aims to propose an algorithm for missing value imputation such that it achieves high accuracy, yet requires minimal time. This presents a novel imputation method: the adaptive multiple imputations of missing values using the class center approach (AMICC). The key contributions of our work are (1) the class center is based on the mean/mode of the data samples and replaces values appropriately, according to the attribute type of the dataset, (2) the proposed adaptive threshold value follows the standard deviation (STD) values, where the computation can indicate how data are spread out over a range of normal and filter outliers, and (3) for outlier data, the missing values are replaced with more appropriate values by using the median and the average weight distance values of the class.

The remainder of this article is organized as follows: “Materials and methods” section presents the related work and describes the proposed model, the diversity operator, and the AMICC algorithm design. The experimental results are presented in “Experiments and results” section, while “Discussion” section offers a discussion and the conclusions are presented in “Conclusions” section.

Materials and methods

In this section, we first present the missing value imputation in “Missing value imputation” section. “Our adaptive multiple imputations of missing values using class centers” section details our proposed method, the AMICC algorithm.

Missing value imputation

MVI uses a statistical or machine learning method to estimate the observed data chosen to replace the missing values. The simplest statistical methods for continuous and discrete variables are mean and mode imputation [39], respectively. Besides statistical techniques, MVI also uses machine learning methods to estimate the observed data chosen to replace the missing values. For instance, MVI analyzes a pattern classification task where the missing feature is employed as the target output for the classification model. The rest of the complete features are the input attributes used to train and test the model [40].

One of the most widely used machine learning techniques is KNN imputation [41], where missing values are imputed using the values calculated from the nearest neighbor observed data. In finding the nearest neighbors, the preferred choice in nearest neighbor classification is to define the Euclidean distance, which is defined as:

$$\begin{aligned} \begin{aligned} dist(x_{i},x_{j})=\sqrt{\sum _{n=1}^{N}[x_{ni}-x_{nj}]^{2}} \end{aligned} \end{aligned}$$

(1)

where function $dist(x_{i}, x_{j})$ computes the distance between the instance $x_{i}$ and $x_{j}$, N is the number of attributes or features, and $x_{ni}$ represents the ith instance in the nth attribute.

The baselines compared with the proposed method are usually based on statistical and machine learning techniques. Two other well-known machine learning techniques for MVI are the SVM [42] and RF [43] algorithms. The SVM algorithm uses kernel functions for the nonlinear mapping of an original feature space into a higher dimensional feature space to build a hyperplane. The RF algorithm is an ensemble of decision tree classifiers, which establishes the outcome based on the predictions of the decision trees. RF predicts the outcome by taking the average or mean of the output from various trees.

Our adaptive multiple imputations of missing values using class centers

Several real-world datasets often found with not-a-numbers (NaNs), blank fields, or other placeholders may have missing values. Training a model with a dataset of many missing values can drastically impact the quality of the machine learning or statistical model, resulting in higher computational costs. If the quantum of missing data is large, the efficiency will fluctuate accordingly.

As indicated in “Missing value imputation” section, missing values are commonly replaced by mean/mode. Hence, in the AMICC method, the class center is based on the mean/mode of the data samples in a specific class. In the MVI approach, the AMICC method replaces the missing values with the mean or mode depending on the attribute type. In the outlier data of each class, the AMICC method identifies the threshold values for checking the outlier data; the threshold values are calculated based on the distances between the class centers and their correspondence to the complete data.

In Fig. 1, the AMICC approach comprises three modules: the first focuses on data preprocessing in “Data pre-processing” section, the second calculates the threshold identification in “Threshold identification” section and the third imputes the missing values in “Imputation of missing values” section. These three modules are described in the following sub-sections.

Data pre-processing

In the data pre-processing section, there are some differences in UCI dataset experiments and missing data types based on the missing completely at random (MCAR) which the presence of missing data does not depend on the input values perse [44]. Therefore, in large datasets plagued by MCAR missing data, samples with missing values can be discarded without biasing the distribution of the remaining data.

This study simulated missing rates of 10%, 20%, 30%, 40%, and 50% [9, 15, 20, 32, 35, 41] to compare the proposed method to the imputation methods listed in the UCI datasets experiment. As shown in Eq. (2), the missing rate is a percentage of the total number of missing values in the dataset. All variables except the class attribute had their missing values simulated.

$$\begin{aligned} \begin{aligned} missing\ rate = \frac{number\ of\ missing\ values \times 100}{number\ of\ examples \times number\ of\ features} \end{aligned} \end{aligned}$$

(2)

The missing rate of 50% in this study is the highest when the number of examples and features is considered. For example, consider the Blood dataset, which contains 748 examples and nine features, as illustrated in Table 1. According to equation (2), there were 3,366 missing values when the missing rate was set to 50% (50 = 3366 × 100 / (748 × 9)).

Additionally, normalization is a technique frequently used during the data preparation process. The goal of normalization is to change the values of numeric columns in the dataset to a standard scale, without distorting differences in the ranges of values. The incomplete dataset must be normalized in the domain [0,1], as normalized data on the same scale avoids the effect of different attribute ranges on distance calculation. Thereafter, the incomplete dataset is divided into two subsets: one is the incomplete data containing the missing values for later imputation, and the other is the complete data without missing values for calculating the initial values of the next step.

For example, Fig. 2 shows a three-class incomplete dataset with ten feature dimensions ($F=10$), in which the question marks represent attributes with missing values. Class i (i = 1 to N; N = 3) of D, denoted by $D_{i}$, is divided into $D_{i\_complete}$ and $D_{i\_incomplete}$.

Threshold identification

Figure 3 shows that the process of identifying the threshold based on the distances between the class centers and their correspondences to the complete data described in more detail below.

From the incomplete dataset D containing N classes, dataset D is divided into complete ($D_{\_complete}$) and incomplete ($D_{\_incomplete}$) subsets, where $D_{\_incomplete}$ contains missing values. For the i-th class of $D_{i\_complete}$, the class center (cent($D_{i}$)), mode, and median are calculated. When computing the class center values for a numerical attribute, the mean is used as the class center. Otherwise, if the attribute is categorical, the mode value is the class center value.

Next, the Euclidean distances between cent($D_{i}$) and every data sample in Class i are computed. Figures 4 and 5 show an example of calculating the center of Class 1, cent($D_{1}$), and the distances between cent($D_{1}$) and the other data samples.

Based on the distances, in Fig. 4a, the mean is used for calculating the distances for a numerical dataset; in Fig. 4b, the mode is used for calculating the distances for a categorical dataset; in Fig. 5, the mean or mode is used for calculating distances for a mixed dataset, in which the mean or mode of these distances is used as the threshold ($T_{1}$) for Class 1. Thereafter, this step is repeated until the threshold for each class is obtained. The pseudocode for the threshold identification module is shown in Algorithm 1.

Imputation of missing values

Imputation techniques can be straightforward or quite complicated. These techniques compute the mean/mode of the non-missing values in the complete data and replace the missing values in incomplete data. A single value replaces a missing value for a single imputation, such as the mean of the entire dataset. Multiple imputations are widely accepted as the standard for dealing with missing data in a variety of research fields. Multiple imputations are used to derive unbiased and valid estimates from available data.

In outlier data, the AMICC method checked the normal distribution using the STD value to determine whether the given measurement deviates from the mean. In statistics, STD is a frequently used yardstick of measurement variability. A low STD value indicates that data points are typically very close to the norm, whereas a high STD value indicates that data points span a wide range of values.

Figure 6 shows that the process of the imputation of missing values consists of the following two steps; the first step is to perform a preliminary imputation of the missing value using the mean/mode of each attribute in a class and the second step is to compare the outlier data with STD values. There are two ways to handle outlier data; (1) if STD $<=$ 1, check the outlier data by calculating the distance between the missing value and the class center; if the distance exceeds a threshold, the missing value is considered outlier data and replaced with the median value. Next, (2) if STD > 1, the missing value is considered an outlier. The average weight distance is calculated from the weight distance between the missing value and its nearest neighbors in the complete data. Then, the average weight distance is replaced for the missing value. The proposed method for imputed values is described in detail below.

Step 1: For Class i, the incomplete dataset ($D_{i\_incomplete}$) is composed of a missing data sample (Num). Figures 7, 8, and 9 illustrate examples of a Class 1 incomplete dataset ($D_{1\_incomplete}$) for numerical, categorical, and mixed datasets, respectively, where the data j (j = 1 to Num) contain one missing value in Figs. 7a, 8a, and 9a and multiple missing values in Figs. 7b, 8b, and 9b. In the examples shown in these figures, the missing feature of data j, cent($D_{1}$), and imputed values are in the red text. The distance between cent($D_{1}$) and the imputed data j is calculated and compared with the threshold $T_{1}$ in the next step.

Step 2: This step consists of two cases, (1) if STD $<=$ 1, from the preliminary imputed dataset from Step 1, Fig. 10 illustrates how to impute outlier data for STD values less than one, in which the algorithm compares the outlier data to the threshold value of the class. In Fig. 11, for example, if the distance is less than $T_{1}$, the imputation process for data j is complete; otherwise, each outlier datum is imputed to the median of Class 1.

In the other case (2), if STD > 1, Fig. 12 shows imputed data to the outlier. According to equation (3), the average weight distance [33] is arrived at by calculating the weight distance between the missing value and its nearest neighbors in the complete data.

$$\begin{aligned} W_{i}=average\left[\frac{1}{dist(y_{i},x_{1})}+\frac{1}{dist(y_{i},x_{2})}+...+\frac{1}{dist(y_{i},x_{j})}\right] \end{aligned}$$

(3)

where $W_{i}$ is a weight distance of ith outlier data, $y_{i}$ is the ith instance of outlier data, and $x_{1}$ is the first instance of complete data. From “Missing value imputation” section, function $dist(y_{i}, x_{j})$ computes the distance between the instance $y_{i}$ and $x_{j}$. This step is repeated until the average weight distance for each instance is obtained.

After computing all average weight distances, the outlier datums are imputed to the average weight distance. Algorithm 2 is proposed for missing value imputation.

Experiments and results

This section presents the performance evaluation and comparison of the proposed AMICC method and statistical and machine learning methods.

Experimental setup

The experimental data included 13 numerical, six categorical, and eight mixed datasets collected from the UCI Machine Learning Repository [45]. These datasets have been the subject of several studies on machine learning methods and cover examples of datasets of small-, medium-, and large-size [9, 12, 13, 24, 25, 35, 41]. The characteristics of these datasets are shown in Tables 1, 2, and 3. All the datasets show considerable diversity in the number of examples, features, and classes.

Table 1 Basic information on the numerical datasets

Adaptive multiple imputations of missing values using the class center

Abstract

Introduction

Materials and methods

Missing value imputation

Our adaptive multiple imputations of missing values using class centers

Data pre-processing

Threshold identification

Imputation of missing values

Experiments and results

Experimental setup

Accuracy analysis

RMSE and hit rate analysis

Execution time analysis

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interestss

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords