 Research
 Open Access
 Published:
Exploring and cleaning big data with random sample data blocks
Journal of Big Datavolume 6, Article number: 45 (2019)
Abstract
Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSPExplore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of readytouse random sample data blocks (called RSP blocks) of the entire data. Blocklevel samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSPExplore method. The experimental results of three real data sets show that the approximate results from RSPExplore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.
Introduction
Motivation
Samplingbased approaches have been adopted to alleviate the burden of big data volume not only when approximate results are useful as exact ones [1,2,3,4,5], but also when the results from a small clean sample can be more accurate than those from the entire dirty data [6,7,8,9]. It is a common practice to iteratively generate small random samples of a big data set to explore the statistical properties of the entire data and define cleaning rules [10,11,12,13,14,15,16,17,18,19]. This iterative process becomes impractical or impossible on small computing clusters due to the communication, I/O and memory costs of cluster computing frameworks that implement a sharednothing architecture [20,21,22]. While these distributed frameworks have not adapted well to the requirements of data exploration tasks, existing sequential techniques don’t scale easily to big data [23]. In fact, there are plenty of data exploration and analysis libraries in common data science languages, e.g., R and Python [24, 25]. To scale these libraries to big data on computing clusters, new distributed implementations are required to process distributed data. Even with distributed algorithms, the memory of the computing cluster may not be enough to hold the entire data. This is because the data growth rate is outpacing the technology scaling [26]. In this paper, our objective is enabling data scientists to iteratively explore and clean big data on small computing clusters by applying existing or userdefined algorithms to a few readytouse random sample data blocks.
On computing clusters, big data is divided into small disjoint distributed data blocks in Hadoop Distributed File System (HDFS) [27]. In fact, HDFS blocks are the basic units of both storage and processing in cluster computing frameworks (e.g., Apache Hadoop^{Footnote 1}, Apache Spark^{Footnote 2}). The MapReduce computing model [28] is adapted to process HDFS blocks in parallel and get the result for the entire data set [29,30,31]. Hadoopbased computing clusters have become the norm for big data management and analysis in different application areas [32,33,34,35,36]. To facilitate random sampling on these clusters, the Random Sample Partition (RSP) distributed data model was proposed recently [37]. In this model, the statistical properties of the data set are preserved in its distributed data blocks by making HDFS blocks as readytouse random samples (RSP blocks) of the entire data. This can avoid both the cost of recordlevel sampling (RLS) and the biased results from BlockLevel Sampling (BLS) of inconsistent HDFS blocks [38,39,40,41]. An RSP is generated offline, and only once, from an HDFS file using a twostage data partitioning method [42, 43]. To analyze the data set, a blocklevel sample of RSP blocks is selected and processed in parallel using a sequential algorithm to get an approximate result for the entire data set. The RSP approach enables approximate big data analysis on small computing clusters especially when the results from a few RSP blocks are equivalent to those from the entire data set [3, 44].
Contributions
Given the statistical and computational advantages of the RSP approach, we employ this approach to address the problem of big data exploration and cleaning on small computing clusters. In this paper, we propose RSPExplore, an RSPbased method to explore, validate and clean big data using a few RSP blocks. Given an RSP stored on a computing cluster, this method enables data scientists to estimate the statistical properties of the entire data set while tuning the amount of processed data according to the available resources. Blocklevel samples of RSP blocks are used to tackle three main tasks: statistical estimation, error detection, and data cleaning. Since RSP blocks are random samples of the same size, the basic principle in this method is using sample estimates from individual RSP blocks as an approximation of the sampling distribution of the sample estimate. From this sampling distribution, an approximate result for the entire data is obtained with a confidence interval. This principle is employed to estimate the summary statistics of single features and the correlation coefficients between features. Similarly, the error detection problem is addressed by estimating the sampling distribution of the sample proportion of errors, outliers, missing, and valid values from a sample of RSP blocks.
With RSPExplore, a preliminarily understanding of the general statistical properties of the data and the potential types of inconsistent values can be obtained earlier and without computing the entire data set. The estimated statistics and proportions guide the definition of data cleaning rules. These rules are applied in parallel to clean samples of RSP blocks. From the cleaned RSP blocks, summary statistics of the unknown clean data are estimated. The three operations, i.e., estimation, detection, and cleaning, can be repeated either independently or in a data pipeline to improve the results. The experimental results of three real data sets show that estimates from a sample of RSP blocks can rapidly converge toward the true values and that cleaning a sample of dirty RSP blocks is enough to estimate the statistical properties of the unknown clean data.
The main contributions of this paper are as follows:

We propose RSPExplore, a new method for big data exploration and cleaning on computing clusters using the RSP approach.

We address error detection as an estimation problem by estimating the proportions of inconsistent values in quantitative data;

We propose an algorithm to estimate the statistical properties of the entire unknown clean data set by cleaning only a few RSP blocks;

We introduce a theoretical analysis on using RSP blocks for statistical estimation;

We empirically demonstrate the performance of RSPExplore on three real data sets.
The remainder of this paper is organized as follows. In "Related work" section, we briefly review related work. Then, we describe the Random Sample Partition (RSP) approach in "Background" section. After that, we propose the RSPExplore method in "Methods" section. We demonstrate the performance of this method on small computing cluster in "Results" section and discuss the implications of this method in "Discussion" section. Finally, we conclude this paper in "Conclusions" section.
Related work
In this section, we briefly review recent research directions on exploring and cleaning big data on computing clusters and show the differences from our method.
Samplingbased approximate big data analysis
Samplingbased approximation has become a common approach for big data analysis in cluster computing frameworks. ApproxHadoop [2] uses online multistage sampling from HDFS blocks to get approximate results. BlinkDB [45] is a distributed approximate query processing engine that uses offline stratified sampling on frequently occurring columns and uniform sampling to support adhoc queries. IncApprox [46] uses online stratified sampling to produce an incrementally updated approximate result from streaming data. These frameworks operate directly on the data and compute error bounds without considering data errors. SampleClean [6], on the other hand, combines sampling and cleaning to improve accuracy of aggregate queries. A random sample of dirty data is created first, and then a data cleaning technique is applied to clean the sample. Next, the cleaned sample is used to estimate the results of aggregate queries. SampleClean supports closedform estimates based on normal approximation. A query is formulated first as calculating the mean value so that the confidence interval of the result can be estimated according to the Central Limit Theorem. ActiveClean [9] is an iterative data cleaning framework that cleans a small sample of the data to produce a model similar to if the full data set were cleaned. It starts with a dirty model, then incrementally cleans a new sample and updates the model. ActiveClean supports convex loss models and uses the model as guide to identify future data to clean by cleaning those records likely to affect the results.
RSPExplore has several key differences from these works. First, the entire data is stored as readytouse disjoint random sample data blocks using the RSP distributed data model. Second, RSPExplore targets at offline workloads where data scientists use a variety of techniques to infer global statistical properties from large data volumes. Third, RSPExplore quantifies the uncertainty of the estimated result using the sampling distribution rather than closedform approximation. Thus, estimators can be used directly without manual reformulation. RSPExplore can also be applied to predictive modeling tasks where an ensemble model is incrementally built from samples of clean RSP blocks.
Big data exploration and profiling
There are a number of works on extending the mainstream cluster computing frameworks for exploratory data analysis. Optimus^{Footnote 3} implements common data exploration and cleaning operations on Spark DataFrames. Spark DataFrame Profile^{Footnote 4} generates statistics (e.g., descriptive statistics, quantiles, histogram) from Spark DataFrames. Cumulon [47] is an endtoend system to optimize the cost of calculating statistics on the cloud. Sketch [48] is a system for aggregation on distributed data sets. Data Canopy [49] computes and maintains basic statistics (mean, variance, standard deviation, correlation, and covariance) from fixed chunks of data. While these frameworks apply operations to the entire data set, RSPExplore applies sequential functions to a few RSP blocks without loading the entire data. RSPExplore can be implemented as an extension to these frameworks and the functionalities in these frameworks can be applied to explore and clean RSP blocks. For instance, a Spark Dataframe can be created from a sample of RSP blocks and processed directly with the operations in these frameworks.
Scaling statistical methods to big data
Cluster computing frameworks with a sharednothing architecture have been adopted to scale iterative algorithms to big data [50]. In addition, to enable data scientists to scale existing statistical methods to big data on computing clusters, libraries were developed in common data science languages such as R and Python. Dask^{Footnote 5} extends common interfaces in Pyhton like NumPy, Pandas to distributed environments. Ray DataFrames [23] is a library for exploratory data analysis by using the same Panda API in Python to run jobs on distributed data sets. RHIPE^{Footnote 6} applies the DivideandRecombine approach [51] to reveal deeper insights from distributed data sets by applying R statistical functions in parallel to individual distributed data blocks. However, the exploration and combination of the blocklevel results becomes a challenge to data scientists due to the large number of distributed data blocks in a big data set and the inconsistency of data distributions in HDFS blocks. In RSPExplore, we solve this problem by first representing the data set as an RSP where each block is a random sample, and then using blocklevel sampling to process only a few RSP blocks. On the other hand, RSP blocks can be used directly as subsamples in statistical estimation procedures such as the Bag of Little Bootstraps (BLB) [52].
Background
RSP is a new approach that makes the distributed data blocks of a big data set as readytouse random samples for approximate big data analysis [37]. This approach can be applied to different scenarios where the entire big data set can’t be computed e.g., when the data volume is bigger than the available memory, or the data is stored in multiple data centers or generated in different time windows. In this section, we briefly present the RSP distributed data model and show how this model enables big data computing on small computing clusters.
RSP distributed data model
Assume that \({\mathbb {D}}\) is a multivariate data set of N records and M features where N is very large so \({\mathbb {D}}\) cannot be analyzed on a single machine. With the RSP model, \({\mathbb {D}}\) is divided into K small disjoint random sample data blocks in advance on a computing cluster, called RSP blocks. Assume \({\mathbb {F}}(x)\) is the sample distribution function (SDF) of a random variable x in \({\mathbb {D}}\). \({\mathrm{T}} = \left\{ {{{\mathrm{D}}_1}} \right. , {{\mathrm{D}}_2},\ldots , \left. {{{\mathrm{D}}_K}} \right\}\) is a random sample partition of \({\mathbb {D}}\), with K RSP blocks each has n records, if
where \(F_k(x)\) denotes the sample distribution function of x in \({\mathrm{D}}_{\mathrm{k}}\) and \(E[F_k(x)]\) denotes its expectation. In practice, \({\mathbb {D}}\) is often stored in HDFS. An RSP is generated from an HDFS file of \({\mathbb {D}}\) using the twostage data partitioning algorithm [42, 43]. Each RSP block is created by combining approximately equal random slices from all the original HDFS blocks. This operation can be scheduled to run offline on the computing cluster, i.e., before starting a data analysis session. \(\mathrm{T}\) is saved as an HDFSRSP file with metadata \({{\mathrm{T}_{\mathrm{metadata}}}}\) storing RSP block information including the size and location. Selecting an RSP block from \({\mathrm{T}}\) is equivalent to drawing a random sample directly from \({\mathbb {D}}\). Consequently, data scientists can use RSP blocks directly in samplingbased approximate big data analysis.
Big data computing with RSP blocks
To analyze \({\mathbb {D}}\) with a function f on a computing cluster, only a few RSP blocks from \(\mathrm{T}\) are randomly selected and computed in parallel on their nodes using a sequential implementation of f. This enables small computing clusters to handle data bigger than the memory size by using a stepwise process known as the asymptotic ensemble learning process. This process depends on \({{\mathrm{T}_{\mathrm{metadata}}}}\) and works as follows:

1.
A blocklevel sample with g RSP blocks, \({\mathrm{S}} = \left\{ {\mathrm{D}}_{\tau _1}, {\mathrm{D}}_{\tau _2}, \ldots ,{\mathrm{D}}_{\tau _g} \right\}\) is selected from \({\mathrm{T}}\) without replacement. \(\tau _1,\tau _2, \ldots , \tau _g\) are the identifiers of the selected RSP blocks;

2.
f is applied in parallel to each RSP block in \({\mathrm{S}}\);

3.
The outputs from these blocks are combined to produce an approximate result for \({\mathbb {D}}\).
To incrementally improve the results, the previous three steps can be repeated where a new blocklevel sample of \(\mathrm{T}\) is used each time and the outputs from all the processed RSP blocks are combined in an approximate result for the entire data. This stepwise process can run until a satisfactory result is obtained or all RSP blocks are used up. The RSP approach was used for predictive modeling tasks such as classification and regression [44]. An RSPbased ensemble model built from a sample of RSP blocks is equivalent to a single model built from the entire data. In this paper, we employ the RSP approach for big data exploration and cleaning.
Methods
Building on the statistical and computational advantages of the RSP approach, we propose to use this approach to address the problem of big data exploration and cleaning on small computing clusters. In this section, we first introduce the RSPExplore method to explore the statistical properties of a big data set \({\mathbb {D}}\) and handle data errors using samples of RSP blocks. Then, we present a theoretical analysis on using RSP blocks for statistical estimation.
RSPExplore
Given a random sample partition \({\mathrm{T}}\) of a big data set \({\mathbb {D}}\) with K RSP blocks stored as an HDFSRSP file, the RSPExplore method enables data scientists to quickly explore the general statistical properties of \({\mathbb {D}}\) and define rules for data validation and cleaning. As shown in Fig. 1, this method uses blocklevel samples from an HDFSRSP file to address three main tasks: statistical estimation, error detection, and data cleaning.
The Statistics Estimator is used to compute an estimate of a statistic of \({\mathbb {D}}\) within a confidence interval. The Error Detector is used for data validation by estimating the proportions of errors, outliers, missing and valid values in \({\mathbb {D}}\) within a confidence interval. In case of inconsistent values, a sample of dirty RSP blocks is cleaned with the Data Cleaner in order to estimate the statistical properties of the unknown clean data set \({\mathbb {D}}_{clean}\).
Statistics estimator
Let f be an estimator function to compute an estimate \({\hat{\theta }}\) of a statistic \(\theta\) of \({\mathbb {D}}\). f is used to compute a sample estimate from an RSP block. From a blocklevel sample \(\mathrm{S}\), a set \(\{{\hat{\theta }}_{q}\}^g_{q=1}\) of g sample estimates are computed inparallel as shown in Fig. 2. The sample estimate \({\hat{\theta }}_{q}\) is a random variable since it depends on a particular RSP block \(\mathrm {D_{\tau _q}}\), i.e., a particular random sample. Thus, the sample estimates \(\{{\hat{\theta }}_{q}\}^g_{q=1}\) are used as an approximation of the sampling distribution of \({\hat{\theta }}\). With this sampling distribution, the variability of the sample estimate can be measured and an interval estimate is calculated as an approximate result for the entire data. We apply this estimation method to univariate (mean, standard deviation, median, MAD, skewness, kurtosis) and bivariate estimators (correlation). For these estimators, the mean of the sampling distribution from \({\mathrm{S}}\) is used an approximate estimate \({{\hat{\theta }}_{\mathrm{S}}} = {\frac{1}{g}}\sum _{q=1}^{g}{{\hat{\theta }}_{q}}\). The Statistics Estimator helps data scientists understand the global statistical properties of the data (e.g., mean, standard deviation, median, MAD, skewness, kurtosis, and correlation) and define validation rules to explore potential quality issues in the data with the Error Detector.
Error detector
We address the error detection task in quantitative data as an estimation problem. Let \(X=\{x_1, x_2, \ldots , x_N\}\) be a random variable defined over a domain of values \({\mathbb {X}}\) and represented as a feature in \({\mathbb {D}}\). We categorize the values of X in four categories: error, missing, outliers, and valid values:

Error values: values that don’t belong to the data type of X (e.g., a negative value in a power consumption feature);

Outliers values: any value that is not error and located more than a specific threshold away from the center of the data. Since the mean and standard deviation are sensitive to outliers, we use the median as a robust metric of location instead of the mean, and the Median Absolute Deviation (MAD) as a robust metric of dispersion instead of the standard deviation [16, 53]. The MAD measures the median distance of all the values from the median. The outliers threshold is then defined as \(a \times MAD\) away from the median where a is often set to 2, 2.5, or 3.

Missing values are often represented using a special string, e.g., NA;

Valid values don’t belong to any of the previous categories.
The proportion of each category is estimated in a similar way to estimating the statistics of \({\mathbb {D}}\). Let h be a predicate function such that \(h(x_i)=1\) wherever X satisfies a given predicate, corresponding to one of the previous categories, and \(h(x_i)=0\) otherwise for \(1 \le i \le N\). The estimator function f is defined to calculate the proportion of values that satisfy the target predicate. The proportion of values of X that satisfy the predicate in \({\mathbb {D}}\) is \(p={\frac{1}{N}}\sum _{i=1}^{N}{(h(x_i)=1)}\) and the proportion of values that satisfy the predicate in an RSP block \(\mathrm {D_{\tau _q}}\) is \({\hat{p}}_{q}={\frac{1}{n}}\sum _{j=1}^{n}{(h(x_j)=1)}\) for \(1 \le q \le g\). From a blocklevel sample \(\mathrm{S}\), g sample proportions are calculated inparallel and used as an approximation of the sampling distribution of the sample proportion. Then, the proportion of values that satisfy the predicate in \(\mathrm{S}\) is \({\hat{p}_{\mathrm{S}}} = {\frac{1}{g}}\sum _{q=1}^{g}{\hat{p}_{q}}\). This average value is returned as an approximate proportion of the true proportion p with a confidence interval. For each of the four categories of values, a predicate is defined to get the approximate proportion from a sample of RSP blocks. The Error Detector can alleviate a critical issue when analyzing big data by estimating the proportion of inconsistent values without computing the entire data or directly running expensive cleaning operations. The estimated proportions guide the definition of rules to clean the data with the Data Cleaner.
Data cleaner
Let C be a function that applies transformation and imputation operations to repair inconsistent values. In practice, C can be any data cleaning technique. A cleaning rule is defined for each category of the inconsistent values. It determines whether to delete or replace the inconsistent values and what the replacing value is, e.g., mean or median. The Data Cleaner enables data scientists to apply C in parallel to each RSP block in \({\mathrm{S}}\) and produce a set of cleaned RSP blocks \({\mathrm{S}_{\mathrm{clean}}} = \left\{ {{{\mathrm{D}'}_{\tau _1}},} \right. {{\mathrm{D}'}_{\tau _2}}, \ldots ,\left. {{{\mathrm{D}'}_{\tau _g}}} \right\}\). These cleaned RSP blocks can be saved in HDFS and the metadata of their locations are recorded in \({{\mathrm{T}_{\mathrm{metadata}}}}\). For each RSP block, the cleaning status and the location of the cleaned block are maintained so that these blocks can be used to estimate the statistical properties of the unknown clean data set \({\mathbb {D}}_{clean}\) with the Statistics Estimator.
As RSP blocks of a clean big data set \({\mathbb {D}}\) are random samples of \({\mathbb {D}}\), cleaned RSP blocks of \({\mathbb {D}}\) are potential representative samples of \({\mathbb {D}}_{clean}\). In this paper, we argue that cleaning a sample of dirty RSP blocks from \({\mathrm{T}}\) is sufficient to estimate the statistical properties of the unknown clean data set \({\mathbb {D}}_{clean}\). Instead of cleaning the entire data, which is often not necessary for data exploration, the Data Cleaner and Statistics Estimator are used in a pipeline to explore the statistical properties of \({\mathbb {D}}_{clean}\) as shown in Fig. 3. Each RSP block \({\rm D}_{\rm q}\) is processed first with C to get a clean RSP block \({{\rm D}'_{\tau _{\rm q}}}\) for \(1 \le q \le g\). Then, f is applied to get an estimate \({\hat{\theta }}'_q\) from \({{\rm D}'_{\tau _{\rm q}}}\). In this case, the Estimator uses the cleaned RSP blocks to get an average estimate with a confidence interval. This estimate in expectation is equal to the true statistic \({\theta }'\) from the unknown entire clean data \({\mathbb {D}}_{clean}\). Algorithm 1 shows the basic operations to clean blocklevel samples of dirty RSP blocks and calculate estimates from the cleaned RSP blocks.
In addition to the parameters of the target function, e.g., an estimator, a detector, or a cleaner, extra parameters are required including the number of RSP blocks g in a blocklevel sample. This can be set according to the number of available cores in the computing cluster to guarantee parallel processing of the selected blocks. In practice, any of these operations can be repeated with multiple samples of RSP blocks in a stepwise process. Furthermore, a data pipeline of these operations can be defined to explore and clean the data.
Theoretical analysis
We introduce a theoretical analysis on using RSP blocks for statistical estimation. First, we define and prove Lemma 1 which states that any record in \({\mathbb {D}}\) has the same probability to be assigned to any RSP block. Then, we prove that RSP estimates (e.g., mean) are unbiased and consistent estimates in Corollary 1.
Lemma 1
For any RSP block \({{\mathrm{D}}_k} = \left\{ {{\mathrm{x}}_1^{\left( k \right) }} , {\mathrm{x}}_2^{\left( k \right) }, \ldots , {{\mathrm{x}}_{{N_k}}^{\left( k \right) }} \right\}\), \(k \in \left\{ {1,2, \ldots ,K} \right\}\) belonging to the RSP \({\mathrm{T}} = \left\{ {{{\mathrm{D}}_1},} {{\mathrm{D}}_2}, \ldots , {{{\mathrm{D}}_K}} \right\}\) of big data set \({\mathbb {D}} = \left\{ {{{\mathrm{x}}_1},{{\mathrm{x}}_2}, \ldots ,{{\mathrm{x}}_N}} \right\}\), \(P\left\{ {{{\mathrm{x}}_i} \in {{\mathrm{D}}_k}} \right\} = \frac{{{N_k}}}{N}\) and \(P\left\{ { {\mathrm{x}}_j^{\left( k \right) } = {{\mathrm{x}}_i}} \right\} = \frac{1}{N}\) hold for any \(i \in \left\{ {1,2, \ldots ,N} \right\}\) and \(j \in \left\{ {1,2, \ldots ,{N_k}} \right\}\).
Proof
There are \(C_N^{{N_k}}\) possible ways of choosing a subset of \(N_k\) records from \({\mathbb {D}}\). When \({{\mathrm{x}}_i} \in {{\mathrm{D}}_k}\), \(i \in \left\{ {1,2, \ldots ,N} \right\}\), there are \(C_{N  1}^{{N_k}  1}\) ways of drawing the remaining \({N_k}  1\) records for RSP block \({{\mathrm{D}}_k}\). Hence,
Furthermore, the \(N_k\) data positions of RSP block \({{\mathrm{D}}_k}\) have the equal possibility to take \({{{\mathrm{x}}_i}}\). Thus,
can be derived for any \(j \in \left\{ {1,2, \cdots ,{N_k}} \right\}\). \(\square\)
Corollary 1
For any RSP block \({{\mathrm{D}}_k} = \left\{ {{\mathrm{x}}_1^{\left( k \right) }} \right. , {\mathrm{x}}_2^{\left( k \right) }, \ldots , \left. {{\mathrm{x}}_{{N_k}}^{\left( k \right) }} \right\}\), \(k \in \left\{ {1,2, \ldots ,K} \right\}\) belonging to the RSP \({\mathrm{T}} = \left\{ {{{\mathrm{D}}_1},} \right. {{\mathrm{D}}_2}, \ldots ,\left. {{{\mathrm{D}}_K}} \right\}\) of big data set \({\mathbb {D}} = \left\{ {{{\mathrm{x}}_1},{{\mathrm{x}}_2}, \ldots ,{{\mathrm{x}}_N}} \right\}\), the mean \({\mu ^{\left( k \right) }} = \frac{1}{{{N_k}}}\sum \nolimits _{j = 1}^{{N_k}} {{\mathrm{x}}_j^{\left( k \right) }}\) is an unbiased estimator of mean \(\mu = \frac{1}{N}\sum \nolimits _{i = 1}^N {{{\mathrm{x}}_i}}\) of big data set \({\mathbb {D}}\).
Proof
From Lemma 1, we can get \(P\left\{ { {\mathrm{x}}_j^{\left( k \right) } = {{\mathrm{x}}_i} } \right\} = \frac{1}{N}\) for any \(i \in \left\{ {1,2, \ldots ,N} \right\}\) and \(j \in \left\{ {1,2, \ldots ,{N_k}} \right\}\). Hence, we can derive:
It indicates that \({{\mu ^{\left( k \right) }}}\) is an unbiased estimator of mean \(\mu\), i.e., the estimator \({{\mu ^{\left( k \right) }}}\) has the unbiasedness. \(\square\)
Based on Corollary 1, we can further derive that \({{\mu ^{\left( k \right) }}}\) is a consistent estimator of mean \(\mu\), i.e., the estimator \({{\mu ^{\left( k \right) }}}\) has the consistency. The variance of mean \({{\mu ^{\left( k \right) }}}\) can be expressed as Eq. (4). According to Chebyshev’s inequality, we can get
holds for any given \(\varepsilon > 0\). Furthermore, the inequality can be transformed to
When the mean and variance of the population exist, the variance \({\sigma ^2} = \frac{{\sum \nolimits _{i = 1}^N {{{\left( {{{\mathrm{x}}_i}  \mu } \right) }^2}} }}{{N  1}}\) of \({\mathbb {D}}\) is an unbiased estimator of the population variance. It is reasonable to assume that \({\sigma ^2}\) is bounded. Hence, we can derive
and
In conclusion, we prove that the mean of an RSP block \({{{\mathrm{D}}_k}}\) is an unbiased and consistent estimator of the mean of \({\mathbb {D}}\). As a result, RSP blocks can be used to get an approximate sampling distribution of a sample statistic and get an average estimate within a confidence interval.
Results
In this section, we demonstrate the performance of RSPExplore in univariate (location, variability/dispersion, skew, kurtosis) and bivariate exploration (correlation) from both clean and dirty data.
Experiment environment and settings
We tested RSPExplore on a small computing cluster of 5 nodes with Apache Spark 1.6 and Microsoft R Server 9.1. We used our Spark implementation of the twostage data partitioning algorithm [42] to generate an HDFSRSP file from each data set. The characteristics of the three data sets and the RSPs generated for this experiment are listed in Table 1. For statistical estimation of summary statistics and correlation coefficients, we applied existing sequential R functions to a sample of RSP blocks in parallel. Each sample of g RSP blocks was executed as a single Spark job with only one stage of g Spark tasks. On the other hand, we applied the parallelized functions from RevoScaleR^{Footnote 7} package to the entire data to get the true statistics and compare with the results from blocklevel samples of RSP blocks. To quantify the uncertainty of the RSPbased estimates, we used percentiles to calculate the confidence interval of an RSPbased estimate. In this paper, we report the results within 90% confidence level, i.e., between the 5th and 95th percentiles of the sampling distribution. We also show error bars from multiple runs of the stepwise process to show the variance of the results. For error detection and data cleaning, we implemented the error detector and data cleaner functions as sequential R functions and applied them to a sample of RSP blocks in a similar way to existing R functions. On the other hand, we used the rxDataStep transformation in RevoScaleR to apply validation and cleaning rules to the entire data and get the true proportions and the entire clean data. For this experiment, we configured the cluster to use 96 cores. Thus, to avoid the latency in processing RSP blocks, the number of RSP blocks g in a blocklevel sample should not largely exceed 96. Since HIGGS data has only 100 RSP blocks, we tested the RSPExplore using blocklevel samples with only \(g=5\) RSP blocks to show how the results change after each sample. For Power data, we used blocklevel samples with \(g=100\) RSP blocks.
Exploring HIGGS data
For this experiment, we generated an RSP from HIGGS^{Footnote 8} data with K = 100 RSP blocks and n = 110,000 records in each block. We started the experiment by exploring the data distribution and summary statistics in some RSP blocks. These blocks have distributions as shown in Fig. 4 and there is no significant difference between the summary statistics values from these blocks. We used the Statistics Estimator and the Error Detector to further explore this data. As an example, we consider feature V2 that represents lepton pT in HIGGS data. We summarize the results as follows:
Summary statistics
In Table 2, we compare the true summary statistics (mean, standard deviation, median, MAD, skewness, and kurtosis) of feature V2 with the RSPbased estimates from \(g=15\) RSP blocks. We can observe that there is no significant difference between the RSPbased estimates and the true values. The estimates have small variance and narrow confidence interval. To show how the RSPbased estimates change when adding more RSP blocks, we ran the Estimator incrementally on blocklevel samples with 5 RSP blocks until all RSP blocks were used up. We notice that the estimated value converges quickly to the true value and the error range becomes narrower with more RSP blocks as shown in Fig. 5.
Correlation
Table 3 shows the correlation coefficients between V2 and 6 other features in HIGGS data. We also compare the estimated coefficients from \(g=15\) RSP blocks with the true values. Figure 6 shows how the estimated coefficients change with more RSP blocks. Similarly to summary statistics, the same observation applies in case of correlation coefficients.
Error detection
We used the Detector to check whether HIGGS data has inconsistent values. For instance, Table 4 shows that V2 has no errors and no missing values, but there is a small proportion of outliers. In this example, negative values are errors and outliers are within \(3*MAD\) of the median. Since HIGGS is not very big and there is only a small proportion of outliers, we don’t show the effect of data cleaning in this case. In the following section, we demonstrate this point with Power data.
Exploring Power data
Power data was extracted from a smart grid database of an industrial area in Guangdong province. This data contains 46,669,266 records with 98 features: smart meter identifier, date, and the remaining 96 features represent power consumption every 15 minutes in a day. For this experiment, we created an RSP with \(K=6667\) RSP blocks and nearly \(n=7000\) records in each RSP block.
We started the experiment by exploring the data distribution and summary statistics in some RSP blocks. Figure 7 shows the density plots of V38 in 4 RSP blocks. In these plots, we limit the maximum value of V38 to 50, 000 to show the distribution clearly. We can see that there is no significant difference between the mean values from different RSP blocks. However, the standard deviation values are not consistent. It is so high when there are extreme values such as those in blocks 1528 and 2569 in the Figure. To investigate more about this data, we used blocklevel samples with \(g=100\) RSP blocks in the estimator, detector and cleaner. In this paper, we show only the results for only one feature, V38. We summarize our findings as follows:
Summary statistics
Table 5 compares the RSPbased statistics of V38 from 100 RSP blocks and the true values from the entire data. While the estimated median and MAD are close to the true values and with narrow confidence intervals, other statistics differ significantly from the true values and with large confidence intervals. This is expected especially with the high standard deviation of V38 in some RSP blocks due to extreme values. To show how the RSPbased estimates from this dirty Power data change when adding more RSP blocks, we ran the Estimator incrementally on blocklevel samples with 100 RSP blocks until all RSP blocks were used up. Figure 8 shows the results from 25 runs of this process for each statistic. The estimates of the mean, median and MAD converge quickly to the true values. In case of the standard deviation, skewness, and kurtosis, the estimate converges to a value that differs significantly from the true value.
Correlation
As the features in this data represent power consumption at certain time points in a day, we can expect that each feature has higher correlation with nearby features. We used the Estimator to explore the correlation coefficients between features. For instance, Table 6 shows the RSPbased coefficients between V38 and 6 other features. To show how the correlation coefficients change with more data, we ran the Estimator incrementally to compute the correlation coefficients from blocklevel samples with 100 RSP blocks until all the blocks were used up. The plots in Fig. 9 shows the results for the correlation between V38 and 6 other features. In a similar way to the RSPbased summary statistics, we notice that the estimated correlation coefficient converge quickly and adding more data doesn’t change the value significantly. However, RSPbased coefficients are not so close to the values from the entire data. Since Pearson’s correlation coefficient depends on the standard deviation, it is sensitive to outliers.
Error detection
Summary statistics from a blocklevel sample of RSP blocks revealed that Power data has inconsistent values. To investigate more about these inconsistent values, we defined three validation rules on each of the power consumption features: (1) negative values are considered as errors, (2) outliers are values that are more than \(3*MAD\) from the median, and (3) missing values are represented as NAs. We estimated the average proportions from blocklevel samples with \(g=100\) RSP blocks and updated the results incrementally. Figure 10 shows that the estimated proportions don’t vary significantly with more RSP blocks. In fact, the average proportions from 100 RSP blocks are comparable to those computed from the entire data as shown in Table 7. Similarly to the summary statistics, incremental proportions with error ranges are shown in Fig. 11.
Data cleaning
We applied the same cleaning rules to the three categories: errors, outliers and missing values. Since the proportion of these values is high, we replaced them with the median value. We used the Data Cleaner to apply these rules to only the first blocklevel sample of \(g=100\) RSP blocks and recomputed the summary statistics and correlation coefficients from the cleaned blocks. For comparison, we also applied the same rules to the entire Power data and recomputed the true values. Table 8 shows the summary statistics of V38 from the cleaned RSP blocks and the true statistics from the entire clean data. Similarly, Table 9 shows the correlation coefficients between V38 and the 6 other features in the cleaned data. To show how the estimated values change with more data, we ran Algorithm 1 incrementally using blocklevel samples with \(g=100\) RSP blocks. Figure 12 shows the results from 25 runs of this process for each statistic and Fig. 13 shows the results for the correlation between V38 and the 6 other features. The estimated values from clean data have the expected characteristics of RSPbased estimates, e.g., converge quickly with small error bounds. This shows the effect of data cleaning and that only a few clean RSP blocks are sufficient to explore the statistical properties of the entire unknown clean data.
Exploring airlines data
For this experiment, we generated an RSP from the Airlines performance data, AirOnTime87to12^{Footnote 9}, with \(K=298\) RSP blocks and \(n=500,000\) records in each block. Then, we used RSPExplore to explore different features about flights delay such as ArrDelay. Figure 14 shows the density plots of the ArrDelay feature in four RSP blocks. We notice the similar distributions in these blocks. We used a blocklevel sample with only \(g=10\) RSP blocks to estimate the statistical properties of the entire data. Table 10 compares true statistics with RSPbased estimates. Similarly, the true correlation coefficients between ArrDelay and 6 other features in the data are compared with the RSPbased coefficients in Table 11. We also used the Error Detector to estimate the proportions of inconsistent values in ArrDelay. As shown in Table 12, this feature has a small proportion of outliers and missing values. From the previous tables we can see that 10 RSP blocks of the AirOnTime87to12 data can be used effectively to get summary statistics and proportions with narrow confidence intervals. These approximate results help data scientists decide on how the data should be cleaned in a similar way to Power data.
Discussion
In the previous experiments, we used three real data sets to demonstrate the statistical advantages of the RSPExplore method. We can see that a few RSP blocks are sufficient to explore both clean and dirty data (less than 15% of HIGGS, 1.5% of Power, and 3.5% of AirOnTime87to12). In average, the processing time is reduced from minutes to seconds for these data sets on our small computing cluster with 5 nodes. Since data scientists iteratively apply a variety of techniques to explore a data set, the RSPExplore method can lead to significant implications on the scalability and efficiency of big data exploration tasks and the productivity of data scientists. This is basically due to the underlying RSP data model that facilitates online random sampling from big data in cluster computing frameworks [37].
In this paper, we demonstrate the performance of RSPExplore on numerical data. However, the same method can be used with categorical data. For instance, a blocklevel sample of RSP blocks can be used to estimate the relative frequencies of the categories in a categorical feature. Similarly, the proportion of erroneous categorical values can be estimated in the same way as we estimate the proportions of inconsistent values in numerical data. Furthermore, RSPExplore can be extended directly to support logical data cleaning tasks such as those discussed in [15, 54]. A blocklevel sample can be used to estimate the proportion of records that don’t satisfy a certain constraint or the proportion of values that are slightly different from the correct value. In this case, the estimated proportions can help data scientists decide on the repairing strategy or whether to tolerate small differences in the values. Since RSPExplore can work on small batches of RSP blocks, it can also be employed for interactive data cleaning where data scientists correct the violations of integrity constraints in a blocklevel sample of RSP blocks. This is necessary for humanintheloop data analysis [55,56,57,58].
In principle, this method can be applied to any multivariate data set where objects are represented by one or more features and stored in a tabular format. This is a common form for representing different types of data whether the source data is structured, semi structured or unstructured. For instance, a corpus of text data is usually represented in a documentterm matrix that is similar to a data frame with records representing the documents and features representing the terms. This matrix can be stored as an RSP. Then, the RSPExplore method can be applied, for instance, to estimate the distribution of a term frequency in the entire corpus using only a blocklevel sample of RSP blocks. This can be used to identify potential stop words in the corpus.
As we demonstrated in this paper, RSPExplore is a good method to quickly understand the global statistical properties in a big data set using existing sequential or userdefine functions. The basic statistics and proportions are estimated from a blocklevel sample without computing the entire data. The number of RSP blocks g in a blocklevel sample can be determined according to the available cores in the cluster. Unfortunately, these computational and statistical advantages can’t be obtained directly in some cases:

Currently, RSPExplore can’t be used to get an approximate histogram. While it is possible to get histograms of individual RSP blocks, building an approximate histogram requires criteria for combining local histograms and quantifying the uncertainty of the approximated global histogram. We are currently working on extending RSPExplore to build an approximate equiwidth histogram that can be used to quickly understand the probability distribution of the entire data;

RSPExplore can’t be used directly to detect and repair duplication errors. It needs an additional step to check duplications across RSP blocks. We are currently experimenting this idea. Furthermore, empirical and theoretical evidences are necessary to study the affect of deduplication on the probability distribution in RSP blocks and the similarity between these blocks and the entire unknown clean data.In fact, big data cleaning burden would dramatically be alleviated if repairing duplicates in only a small blocklevel sample was enough to get samples of the entire unknown clean data.

RSPExplore is not designed for streaming data. As we mentioned before, we target at offline workloads where data scientists explore big data sets on computing clusters with a variety of techniques. For steaming data, a different strategy is required to get synopses of the data such as sketching [59].

If the target is to find statistics or proportions in a certain subspace in the data, we may need alternative data partitioning algorithms to create RSP blocks with specific characteristics (e.g., each block is a random sample of the observations about customers in a certain city or branch). This issue still needs more investigation and experiments.
Conclusions
In this paper, we presented the RSPExplore method for big data exploration and cleaning on small computing clusters. This method addresses three main tasks using the RSP approach: statistical estimation, error detection and data cleaning. With this method, data scientists can tune the amount of processed data according to the available cores in a computing cluster. We demonstrated that a few RSP blocks are enough to explore the statistical properties of a big data set including summary statistics and proportions of inconsistent values. The experimental results of three real data sets show that RSPbased estimates can rapidly converge toward the true values and that cleaning a sample of RSP blocks is enough to estimate the statistical properties of the unknown clean data. Some of our current and future works include using the RSP approach for histogram estimation, data visualization, and feature selection. In addition, we are experimenting alternative data partitioning algorithms to generate RSP blocks on small computing cluster.
Availability of data and materials
The HIGGS dataset analysed during the current study is available in the UCI Machine Learning repository, https://archive.ics.uci.edu/ml/datasets/HIGGS. The AirOnTime87to12 dataset analysed during the current study is available online at https://packages.revolutionanalytics.com/datasets/AirOnTime87to12/. The Power dataset analysed during the current study is not publicly available.
Abbreviations
 RSP:

random sample partition
 HDFS:

Hadoop Distributed File System
 RLS:

recordlevel sampling
 BLS:

blocklevel sampling
 SDF:

sample distribution function
References
 1.
Nair R. Big data needs approximate computing: technical perspective. Commun ACM. 2014;58(1):104. https://doi.org/10.1145/2688072.
 2.
Goiri Í, Bianchini R, Nagarakatte S, Nguyen TD. Approxhadoop: bringing approximations to mapreduce frameworks. In: Proceedings of the twentieth international conference on architectural support for programming languages and operating systems. ACM; 2015. p. 383–97.
 3.
Salloum S, Huang JZ, He Y. Empirical analysis of asymptotic ensemble learning for big data. In: Proceedings of the 3rd IEEE/ACM international conference on big data computing, applications and technologies. BDCAT ’16, ACM, New York, NY, USA; 2016, p. 8–17. https://doi.org/10.1145/3006299.3006306.
 4.
Kwon BC, Verma J, Haas PJ, Demiralp. Sampling for scalable visual analytics. IEEE Comput Graph Appl. 2017;37(1):100–8. https://doi.org/10.1109/MCG.2017.6.
 5.
Riondato M. Samplingbased data mining algorithms: modern techniques and case studies. In: Proceedings of the 2014th European conference on machine learning and knowledge discovery in databasesvolume part III. ECMLPKDD’14, Berlin, Heidelberg: Springer; 2014, p. 516–9.
 6.
Krishnan S, Wang J, Franklin MJ, Goldberg K, Kraska T, Milo T, Wu E. Sampleclean: fast and reliable analytics on dirty data. IEEE Data Eng Bull. 2015;38(3):59–75.
 7.
Sutton CA, Hobson T, Geddes J, Caruana R. Data diff: interpretable, executable summaries of changes in distributions for data wrangling. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, KDD 2018, London, UK, August 19–23, 2018; 2018. p. 2279–88. https://doi.org/10.1145/3219819.3220057.
 8.
Chu X, Ilyas IF, Krishnan S, Wang J. Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 international conference on management of data. SIGMOD ’16. ACM, New York, NY, USA; 2016, p. 2201–6. https://doi.org/10.1145/2882903.2912574.
 9.
Krishnan S, Wang J, Wu E, Franklin MJ, Goldberg K. Activeclean: interactive data cleaning for statistical modeling. Proc VLDB Endow. 2016;9(12):948–59. https://doi.org/10.14778/2994509.2994514.
 10.
Tukey JW. Exploratory data analysis. Behavioral science: quantitative methods. AddisonWesley, Reading: Mass; 1977.
 11.
Rojas JAR, Kery MB, Rosenthal S, Dey A. Sampling techniques to improve big data exploration. In: 2017 IEEE 7th symposium on large data analysis and visualization (LDAV); 2017, p. 26–35. https://doi.org/10.1109/LDAV.2017.8231848.
 12.
Idreos S, Papaemmanouil O, Chaudhuri S. Overview of data exploration techniques. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. SIGMOD ’15. ACM, New York, NY, USA; 2015, p. 277–81. https://doi.org/10.1145/2723372.2731084.
 13.
Kandel S, Heer J, Plaisant C, Kennedy J, van Ham F, Riche NH, Weaver C, Lee B, Brodbeck D, Buono P. Research directions in data wrangling: visualizations and transformations for usable and credible data. Inf Vis. 2011;10(4):271–88. https://doi.org/10.1177/1473871611415994.
 14.
Dasu T, Johnson T. Exploratory data mining and data cleaning. 1st ed. New York: Wiley; 2003.
 15.
Abedjan Z, Chu X, Deng D, Fernandez RC, Ilyas IF, Ouzzani M, Papotti P, Stonebraker M, Tang N. Detecting data errors: where are we and what needs to be done? Proc VLDB Endow. 2016;9(12):993–1004. https://doi.org/10.14778/2994509.2994518.
 16.
Hellerstein JM. Quantitative data cleaning for large databases; 2008.
 17.
Krishnan S, Haas D, Franklin MJ, Wu E. Towards reliable interactive data cleaning: a user survey and recommendations. In: Proceedings of the workshop on humanintheloop data analytics. HILDA ’16. ACM, New York, NY, USA; 2016, p. 9–195. https://doi.org/10.1145/2939502.2939511.
 18.
Polyzotis N, Roy S, Whang SE, Zinkevich M. Data lifecycle challenges in production machine learning: a survey. SIGMOD Rec. 2018;47(2):17–28. https://doi.org/10.1145/3299887.3299891.
 19.
Ilyas IF, Chu X. Trends in cleaning relational data: consistency and deduplication. Found Trends Databases. 2015;5(4):281–393. https://doi.org/10.1561/1900000045.
 20.
Park Y, Cafarella MJ, Mozafari B. Visualizationaware sampling for very large databases; 2015. CoRR arXiv:abs/1510.03921.
 21.
Fisher D. Big data exploration requires collaboration between visualization and data infrastructures. In: HILDA ’16 proceedings of the workshop on humanintheloop data analytics, San Francisco, California, 26 June–1 July 2016. New York: ACM; 2016. https://dl.acm.org/citation.cfm?id=2939518.
 22.
Wang Y, Zhong Y, Ma Q, Yang G. Distributed and parallel construction method for equiwidth histogram in cloud database. Multiagent Grid Syst. 2017;13(3):311–29.
 23.
Yang P. Ray dataframes: a library for parallel data analysis. Master’s thesis, EECS Department, University of California, Berkeley; May 2018. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS201884.html. Accessed 15 Jan 2019.
 24.
Wickham H, Grolemund G. R for data science: import, tidy, transform, visualize, and model data. 1st ed. Sebastopol: O’Reilly Media Inc; 2017.
 25.
Wes M. Python for data analysis. 1st ed. Sebastopol: O’Reilly Media Inc; 2012.
 26.
Chong FT, Heck MJR, Ranganathan P, Saleh AAM, Wassel HMG. Data center energy efficiency: improving energy efficiency in data centers beyond technology scaling. IEEE Design Test. 2014;31(1):93–104. https://doi.org/10.1109/MDAT.2013.2294466.
 27.
Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST); 2010, p. 1–10. https://doi.org/10.1109/MSST.2010.5496972.
 28.
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13. https://doi.org/10.1145/1327452.132749.
 29.
Zaharia M, Chowdhury M, Das T, Dave A. Resilient distributed datasets: a faulttolerant abstraction for inmemory cluster computing. In: NSDI’12 Proceedings of the 9th USENIX conference on networked systems design and implementation; 2012, p. 2. https://doi.org/10.1111/j.10958649.2005.00662.x.
 30.
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I. Apache spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65. https://doi.org/10.1145/2934664.
 31.
Salloum S, Dautov R, Chen X, Peng PX, Huang JZ. Big data analytics on apache spark. Int J Data Sci Anal. 2016;1(3):145–64. https://doi.org/10.1007/s4106001600279.
 32.
Bengfort B, Kim J. Data analytics with Hadoop: an introduction for data scientists. Sebastopol: O’Reilly Media; 2016. https://books.google.com.hk/books?id=ou9FDAAAQBAJ.
 33.
Siuly S, Zhang Y. Medical big data: neurological diseases diagnosis through medical data analysis. Data Sci Eng. 2016;1(2):54–64. https://doi.org/10.1007/s4101901600113.
 34.
Jacobs B. White paper: accelerating R analytics with Spark and Microsoft R Server for Hadoop. White Paper; 2016. https://info.microsoft.com/rs/157GQE382/images/ENCNTNTWhitepaperSparkMicrosoftRServerHadoop.pdf. Accessed 17 May 2019.
 35.
VargasSolar G, ZechinelliMartini JL, EspinosaOviedo JA. Big data management: what to keep from the past to face future challenges? Data Sci Eng. 2017;2(4):328–45. https://doi.org/10.1007/s4101901700433.
 36.
Dolev S, Florissi P, Gudes E, Sharma S, Singer I. A survey on geographically distributed bigdata processing using mapReduce. IEEE Trans Big Data. 2017;99:1. https://doi.org/10.1109/TBDATA.2017.2723473.
 37.
Salloum S, Huang JZ, He Y. Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Inf. 2019. https://doi.org/10.1109/TII.2019.2912723.
 38.
Ci X, Meng X. In: Dong XL, Yu X, Li J, Sun Y (eds) An efficient block sampling strategy for online aggregation in the cloud. Cham: Springer; 2015. p. 362–73.
 39.
Chaudhuri S, Das G, Srivastava U. Effective use of blocklevel sampling in statistics estimation. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data. SIGMOD ’04. ACM, New York, NY, USA; 2004, p. 287–98. https://doi.org/10.1145/1007568.1007602.
 40.
Kalavri V, Brundza V, Vlassov V. Block sampling: efficient accurate online aggregation in mapreduce. In: 2013 IEEE 5th international conference on cloud computing technology and science, vol. 1; 2013, p. 250–57. https://doi.org/10.1109/CloudCom.2013.40.
 41.
Wang Y, Zhong Y, Ma Q, Yang G. Distributed and parallel construction method for equiwidth histogram in cloud database. Multiagent Grid Syst. 2017;13(3):311–29. https://doi.org/10.3233/MGS170273.
 42.
Wei C, Salloum S, Emara TZ, Zhang X, Huang JZ, He Y. A twostage data processing algorithm to generate random sample partitions for big data analysis. In: Luo M, Zhang LJ, editors. Cloud Computing—CLOUD 2018. Cham: Springer; 2018. p. 347–64.
 43.
Emara TZ, Huang JZ. A distributed data management system to support largescale data analysis. J Syst Softw. 2019;148:105–15. https://doi.org/10.1016/j.jss.2018.11.007.
 44.
Salloum S, Huang JZ, He Y, Chen X. An asymptotic ensemble learning framework for big data analysis. IEEE Access. 2018;. https://doi.org/10.1109/ACCESS.2018.2889355.
 45.
Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica, I. Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems. ACM; 2013, p. 29–42.
 46.
Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R. Incapprox: A data analytics system for incremental approximate computing. In: Proceedings of the 25th International Conference on World Wide Web. WWW ’16. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland; 2016, p. 1133–44. https://doi.org/10.1145/2872427.2883026.
 47.
Huang B, Babu S, Yang J. Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. SIGMOD ’13. ACM, New York, NY, USA; 2013, p. 1–12. https://doi.org/10.1145/2463676.2465273.
 48.
Budiu M, Isaacs R, Murray D, Plotkin G, Barham P, AlKiswany S, Boshmaf Y, Luo Q, Andoni A. Interacting with large distributed datasets using sketch. In: Proceedings of the 16th eurographics symposium on parallel graphics and visualization. EGPGV ’16. Eurographics Association, Goslar Germany, Germany; 2016, p. 31–43. https://doi.org/10.2312/pgv.20161180.
 49.
Wasay A, Wei X, Dayan N, Idreos S. Data canopy: accelerating exploratory statistical analysis. In: Proceedings of the 2017 ACM international conference on management of data. SIGMOD ’17. ACM, New York, NY, USA; 2017, p. 557–72. https://doi.org/10.1145/3035918.3064051.
 50.
Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source tools for machine learning with big data in the hadoop ecosystem. J Big Data. 2015;2(1):1–36. https://doi.org/10.1186/s4053701500321.
 51.
Guha S, Hafen R, Rounds J, Xia J, Li J, Xi B, Cleveland WS. Large complex data: divide and recombine (d&r) with rhipe. Stat. 2012;1(1):53–67. https://doi.org/10.1002/sta4.7.
 52.
Kleiner A, Talwalkar A, Sarkar P, Jordan MI. A scalable bootstrap for massive data. J R Stat Soc Series B Stat Methodol. 2014;76(4):795–816. https://doi.org/10.1111/rssb.12050.
 53.
Leys C, Ley C, Klein O, Bernard P, Licata L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psychol. 2013;49(4):764–6. https://doi.org/10.1016/j.jesp.2013.03.013.
 54.
Prokoshyna N, Szlichta J, Chiang F, Miller RJ, Srivastava D. Combining quantitative and logical data cleaning. Proc VLDB Endow. 2015;9(4):300–11. https://doi.org/10.14778/2856318.2856325.
 55.
Rezig EK, Ouzzani M, Elmagarmid AK, Aref WG. Humancentric data cleaning [vision]. 2017. CoRR arXiv:abs/1712.08971.
 56.
Doan, A. Humanintheloop data analysis: A personal perspective. In: Proceedings of the workshop on humanintheloop data analytics. HILDA’18. ACM, New York, NY, USA; 2018, p. 1–116. https://doi.org/10.1145/3209900.3209913.
 57.
Liu J, Wilson A, Gunning D. Workflowbased humanintheloop data analytics. In: Proceedings of the 2014 workshop on human centered big data research. HCBDR ’14. ACM, New York, NY, USA; 2014, p. 49–494952. https://doi.org/10.1145/2609876.2609888.
 58.
Anderson MR, Antenucci D, Cafarella MJ. Runtime support for humanintheloop feature engineering system. IEEE Data Eng Bull. 2016;39(4):62–84.
 59.
Cormode G. Data sketching. Commun ACM. 2017;60(9):48–55. https://doi.org/10.1145/3080008.
Acknowlegements
We sincerely thank the editors and anonymous reviewers whose valuable comments and suggestions helped us improve this paper significantly.
Funding
This paper was supported by the National Natural Science Foundation of China (61473194), National Key R&D Program of China (2017YFC08226042), China Postdoctoral Science Foundation (2016T90799) and Scientific Research Foundation of Shenzhen University for Newlyintroduced Teachers (2018060).
Author information
Affiliations
Contributions
SS performed the primary work and experiments of this manuscript. JZH proposed the main idea, took on a supervisory role and oversaw the completion of the work. YH did the theoretical analysis. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Yulin He.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Big data
 Exploratory data analysis
 Statistical estimation
 Data cleaning
 Blocklevel sampling
 Random sample partition
 Distributed
 Parallel and cluster computing