Skip to main content

Enhancing correlated big data privacy using differential privacy and machine learning

Abstract

Data are often correlated in real-world datasets. Existing data privacy algorithms did not consider data correlation an inherent property of datasets. This data correlation caused privacy leakages that most researchers left unnoticed. Such privacy leakages are often caused by homogeneity, background knowledge, and linkage attacks, and the probability of such attacks increases with the magnitude of correlation among data. This problem further got magnified by the large size of real-world datasets, and we refer to these large datasets as ’Big Data.’ Several researchers proposed algorithms using machine learning models, correlation analysis, and data privacy algorithms to prevent privacy leakages due to correlation in large-sized data. The current proposed work first analyses the correlation among data. We studied the Mutual Information Correlation analysis technique and the distance correlation analysis technique for data correlation analysis. We found out distance correlation analysis technique to be more accurate for high-dimensional data. It then divides the data into blocks using the correlation computed earlier and applies the differential privacy algorithm to ensure the data privacy expectations. The results are derived based upon multiple parameters such as data utility, mean average error, variation with data size, and privacy budget values. The results showed that the proposed methodology provides better data utility when compared to the works of other researchers. Also, the data privacy commitments offered by the proposed method are comparable to the other results. Thus, the proposed methodology gives a better data utility while maintaining the required data privacy commitments.

Introduction

The massive generation of data from our day-to-day life has led to large, voluminous, and heterogeneous data getting produced daily. Due to this reason, the real-world datasets are primarily large and possess high dimensionality. The traditional privacy algorithms are no longer sufficient to ensure the privacy of large-sized datasets, especially when data is highly correlated [1, 2]. Hence many researchers are working towards producing algorithms that can take care of these challenges. Our previous work [4] gave a detailed description of all the works of global researchers who shed light on this issue and proposed solutions to deal with it. Among all the other pieces, we identified the work presented in [5] by authors Lv et al. as the most potential one and extended our research in the same direction. This paper presents a solution using Distance Correlation Analysis Technique and showcases our results. We compared our results with the results of [5] and subsequently established the supremacy of our method.

Data privacy protection

Data privacy has been a topic of concern for a long time. The concern grew among researchers with the data’s increase in size and dimensionality. The classical data privacy algorithms are k-anonymity [6, 7], l-diversity [7, 8], t-closeness [9] and differential privacy [10, 11]. Figure 1 briefly describes the traditional data privacy algorithms. Most of the research regarding data privacy primarily revolves around DP. It is due to its widespread use in this field. Researchers have observed and studied many threats to data privacy over the years. Among them, the existence of data correlation within the data is one of the potential causes of privacy leakages [12,13,14].

Fig. 1
figure 1

Pros and cons of classical data privacy algorithms

Initial research around data privacy ignored data correlation and considered data as IID. A piece of data is said to be IID, i.e., Independent and Identically Distributed, when it does not hold any relation with other data of the dataset and its distribution is identical throughout the dataset. In other words, there exists no correlation among data within the dataset. But suppose a correlation exists among such data and during the application of data privacy algorithms. In that case, if it gets ignored, then such assumptions can lead to potential privacy leakages [5, 15, 16]. This threat increases with the size and dimensionality of data. Hence, one can conclude that data correlation is a more significant threat to big data [17]. Big data privacy often gets compromised by ignoring the data correlation within the data.

The work presented in this research paper outlines all the related results where data correlation threatens data privacy and big data privacy. This work also suggests a methodology to deal with the mentioned problem. The proposed approach initially offers to realize the correlation amongst data using the Distance Correlation analysis method. Then, using this correlation as a parameter, clustering is performed over the dataset and divided into blocks. After that, data sensitivity gets calculated concerning the individual blocks instead of Global Sensitivity (GS). The last step is to use the calculated sensitivity to apply differential privacy for the data blocks. The Distance correlation analysis method applied at the first step ensures proper recognition and consideration of data correlation in the big dataset. The divide and conquer approach is adopted to handle the high dimensionality of the data. Calculating sensitivity ensures that noise of lower magnitude gets added to increase the data utility. The Pyspark technology is adopted to manage the processing of big data. Finally, we compare the observations extracted from the performed experiment with the results of [5] to establish its effectiveness.

Data privacy and data utility

Data utility means that we should maintain the availability of the data when conducting privacy preservation [18]. More clearly, data utility means the ease of using noisy data for data mining and other analysis, maintaining the correctness and accuracy of the analytical result drawn from the noisy data [18]. The most common metric to measure data utility is Information Loss. The more the information loss, the less the utility of data [18]. Data privacy also has a relationship with data utility. The privacy performance of the data decreases with the increase in data utility, and it creates a trade-off between data utility and data privacy. Figure 2 is a graphical representation of the same. To plot the graph between data privacy and data utility, we have calculated data privacy as the number of attributes that has been anonymized and to calculate data utility we have used Information Gain. In the presented work, we aim to provide enhanced data utility while maintaining the required data privacy levels. Thus, the measurement of data utility is crucial for the proposed methodology.

Fig. 2
figure 2

Data utility and data privacy trade off

Data correlation and data privacy

Before 2011, when researchers talked about data privacy algorithms, they considered that there existed no correlation among data. But soon, in 2011, researchers started studying the potential of data correlation as a privacy threat to data and have given enough instances to support it [4]. Many privacy leakages became evident with the existence of data correlation among data. The privacy leakages were caused due to the homogeneity attack, background knowledge attack, and linkage attack. The main contributing factor to these attacks was unnoticed data correlation. If one ignored the existence of data correlation, these privacy leakages would also go unnoticed and cause a reduction in the privacy commitments of the data privacy algorithms. The pioneer in this was Gehrke et al. [19], who considered the case of social networks where users and their data are highly correlated, and even the strong privacy guarantee provided by Differential Privacy could not assure privacy for Social Network settings. Then in the subsequent years, researchers worked over the same lines and with various data to study the privacy threats associated with data correlation and proposed solutions. Table 2 gives a summary of all such works. Also, the following section provides a deeper insight into the same.

Another problem associated with the existence of data correlation is sensitivity. Correlated data causes a higher value of sensitivity of the data. While applying differential privacy algorithm, a higher sensitivity value will cause higher noise to be added to the original data. It adversely affects the data utility and causes its reduction. This is an undesirable effect and may render the privatized data useless.

Correlated big data privacy

As stated initially, data correlation poses a big threat to data privacy, and it causes privacy leakages that go unnoticed and causes unexpected compromises in data privacy. The real-world datasets are often large and accompanied by high dimensionality, which in turn causes high data correlation [4], which causes a potential threat to big data privacy. Given the massive amount of data and the combination of structured and unstructured data, some new Big Data models are a need to improve privacy and protection [4].

Organisation of the paper

This paper suggests a mechanism that deals with three main problems—Processing high dimensional big datasets, ensuring privacy protection of correlated data while using DP as the main privacy algorithm, and maintaining data utility while enhancing data privacy. The organization of the remaining paper is as follows: “Literature review” section presents a brief literature review of the topic. “Basic principles and theories” section gives a short description of the basic principles and theories used. Whereas “Proposed methodology” section discusses the proposed solution for the privacy protection of correlated big data and submits the model algorithm to implement it. “Experiment and analysis” section describes the performed experiments and presents the analysis and results. Then the paper finally concludes. Table 1 provides a list of abbreviations used in this work.

Table 1 List of abbreviations

Literature review

Differential privacy provides very robust privacy protection for data. It is dependent on pure mathematical theories. Researchers from 2011 have studied the potential of data correlation as a privacy threat to data and have given enough instances to support it [4]. Gehrke et al. [19], in the year 2011, considered the case of social networks where users and their data are highly correlated, and even the strong privacy guarantee provided by Differential Privacy could not assure privacy for Social Network settings. Kifer et al. [13] in 2011 gave initial arguments that the consideration of correlation between records is pivotal as the correlation between records or attributes can substantially decrease the privacy guarantee provided by any algorithm. Those mentioned above were the initial attempts to formalize Data Correlation as a general phenomenon for real-time datasets. These are considered pioneers in realizing the existence of data correlation in datasets and its potential as a privacy threat. Kifer et al. [21], in their successive work in the year 2014, proposed a privacy mechanism called Pufferfish. The mechanism helped develop privacy definitions for different data-sharing needs, studied existing privacy definitions, studied privacy compromise due to non-independent data records, and several other critical issues in terms of privacy. Since then, Yang et al. [22] in 2015, Wang et al. [23], Chen et al. [24] in 2017 proposed some solutions to it using Bayesian Networks, [21,22,23,24] proposed solutions using Probabilistic Models, Cao et al. in 2012 [25] and 2013 [26] proposed solutions using Behavioural and Similarity Analysis, Chen et al. [27] in 2013, and Liu et al. [15] in the year 2016 proposed modified perturbation mechanisms, Authors Kumar et al. [17] in 2018 have tried to offer solutions using Statistical Correlation Analysis Method, Lv et al. [5] in 2019 and Zhao et al. [15] proposed modifications to the DP algorithm, and the authors of [16, 28, 29] present the recent advancements regarding the same. Table 2 summarizes the notable works done by previous researchers and throws light on the limitations of the approaches proposed by them.

Among the discussed works, [5] is the most relevant to the proposed work. In [5], authors Lv et al. studied the data correlation among the dataset and then utilized the same to assure data privacy of the dataset using differential privacy. The main shortcomings of this paper were—(i) the Use of the Mutual Information Correlation analysis technique to calculate the data correlation among data, (ii) Its inefficiency in handling voluminous and high dimensional data, i.e., Big Data, (iii) Very few parameters were used for evaluation of the solution. In the presented work, we have proposed solutions to the shortcomings of [5].For the same, we used the Distance Correlation analysis method as it could handle high dimensional data. PySpark technology and the divide and conquer approach supported big data processing. We used parameters such as data utility, mean average error, information loss, variation with data size, and privacy budget values to evaluate the proposed method.

Table 2 Proposed solutions

Basic principles and theories

Differential privacy definition

The foundation of the differential privacy mechanism is based on the concept of adjacent data sets. Let’s consider two datasets, \(D_1\) and \(D_2\), which differ in one record, denoted by \(\mid D_1 \Delta D_2 \mid = 1\); then these datasets are termed as adjacent datasets. The conventional definition of Differential Privacy is as follows—Suppose \(D_1\) and \(D_2\) are adjacent datasets, A is a privacy mechanism and if A satisfies \(\epsilon\) -differential privacy for any output \(A(D_1)\) \(\rightarrow\) R, S \(\in\) R, then

$$\begin{aligned} Pr[A(D_1) \in S] \le e^{\epsilon } Pr[A(D_2) \in S] \end{aligned}$$
(1)

Often, such privacy mechanisms are realized using global sensitivity (GS). It is the measure of change in other records due to modification of another record. Let \(D_1\) and \(D_2\) be adjacent datasets; the global sensitivity can be defined as follow:

$$\begin{aligned} GS = max_{D_1,D_2} ||f(D_1)-f(D_2)||_1 \end{aligned}$$
(2)

where f is the query function, \(||.||_1\) is the L1-norm.

The differential privacy can be realized by adding noise ‘e’ to the output.

$$\begin{aligned} A(D) = f(D)+ e \end{aligned}$$
(3)

Definition of laplace mechanism

The Laplace mechanism will compute the function and perturb each coordinate with noise drawn from the LM distribution [32]. The noise scale will get adjusted to the sensitivity of the function (divided by \(\epsilon\)). LM is used when the output is numerical [32].

Given the query function f and dataset D and the parameter \(\epsilon\), the Laplace mechanism A satisfies the following equation:

$$\begin{aligned} A(D) = f(D) + Laplace\left( \frac{GS}{\epsilon }\right) \end{aligned}$$
(4)

where \(Laplace\left( \frac{GS}{\epsilon }\right)\) is Laplace distribution, and parameter \(\epsilon\) is called privacy budget. In practical applications, the inverse cumulative distribution function usually realizes the noise e.

Distance correlation

In the world of statistics, distance correlation means a measure of dependence between a pair of random vectors not necessarily having equal dimensions. The coefficient value ranges from [0,1], where the distance correlation coefficient is zero if and only if the random vectors are independent. It has the power to measure both linear and non-linear associations between the pair of random vectors, whereas the Pearson correlation can only capture the linear association between them. [33] shows how distance correlation gets estimated from the given data samples. A practical inference is that, by using two matrices, one can calculate the distance correlation. One of the matrices contains the pairwise distances between observations from X, and the other matrix contains observations from Y. We say X and Y are highly correlated if the items in these matrices co-varies together; otherwise, they have a meager correlation value.

Distance covariance The equation for Pearson covariance (Cov) between X and Y is given below:

$$\begin{aligned} Cov(x,y) = \frac{1}{n^2}\sum _{i=1}^{n}\sum _{j=1}^{n}\frac{1}{2}(x_i-x_j)(y_i-y_j) \end{aligned}$$
(5)

The terms \((x_i-x_j)\) and \((y_i -y_j)\) can be considered as a signed distance between \(i^{th}\) and \(j^{th}\) sample in one-dimension. These have been modified to centered Euclidean distances \(D(x_i, x_j)\) in order to define distance covariance as given below:

$$\begin{aligned} DCov(x,y) = \frac{1}{n^2}\sum _{i=1}^{n}\sum _{j=1}^{n}D(x_i,x_j).D(y_i,y_j) \end{aligned}$$
(6)

Following are the properties for distance covariance:

  1. I.

    If X and Y are independent, then DCov (X, Y) = 0

  2. II.

    DCov (X, Y)\(\ge\) 0 and DCov (X, Y)\(\le\)1

  3. III.

    \(DCov^2(a_1 + b_1C_1 X, a_2+b_2C_2 Y) = \mid b_1 b_2\mid DCov^2(X, Y)\) for all scalars \(b_1, b_2\), constant vectors \(a_1, a_2\) and orthonormal matrices \(C_1, C_2.\)

  4. IV.

    This is also defined for random variables in different dimensions, as we can compute the distance between observations in any dimension.

  5. V.

    If the random vectors \((X_1, Y_1) and (X_2, Y_2)\) are independent then \(DCov (X_1 + X_2, Y_1 + Y_2) \le DCov (X_1, Y_1) + DCov (X_2, Y_2)\)

Distance variance It is a special case of distance covariance where the two variables are the same

$$\begin{aligned} DVar(x) = DCov(x,x) \end{aligned}$$
(7)

Properties of distance variance are:

  1. I.

    DVar(X) = 0 if and only if X = E[X], where E is the Euclidean Distance.

  2. II.

    DVar(X) = 0 if and only every observation is the same.

  3. III.

    DVar(A+bCX) = \(\mid b \mid\) DVar(x) for all constant orthonormal matrices C, constant vectors A, and scalars B

  4. IV.

    DVar(X+Y) \(\le\) DVar(X) + DVar(Y) if and only if X and Y are independent.

Distance correlation One can obtain it for two random variables by dividing distance covariance by the product of their distance standard deviation. It is given by:

$$\begin{aligned} DCor(x,y) = \frac{DCov(x,y)}{\sqrt{DVar(x)*DVar(y)}} \end{aligned}$$
(8)

Properties:

  1. I.

    0\(\le\) DCor (X, Y) \(\le\) 1 for all X and Y; This is different from Pearson Correlation, where correlation can even be negative

  2. II.

    DCor (X, Y) = 0 if and only if X and Y are independent

  3. III.

    It defines a statistical test for dependence with a permutation test.

Information gain

Information gain is a popular metric that is used to measure Data Utility. It measures the amount of information that can be derived from the given dataset. To compute it, we have used Information Loss [34]. Its value ranges between 0 to 1. One can compute Information Loss using the:

$$\begin{aligned} Information\, Loss\, in\, each\, field\, = \frac{|(Value_{original} - Value_{modified})|}{(Value_{original} + Value_{modified})} \end{aligned}$$
(9)

where \(Value_{original}\) is the original value of the attribute and \(Value_{modified}\) is the attribute’s value after applying the final methodology. Information Gain gets calculated by complementing the value of Information Loss [34].

$$\begin{aligned} Information\, Gain\, of\, each\, field\, = (1 - Information\,Loss) \end{aligned}$$
(10)

Proposed methodology

As per the literature review and detailed research on the topic, one can extract the following:

  1. 1.

    Data Correlation must be studied, analyzed, and considered to ensure data privacy needs of data. When left unnoticed, these can cause severe privacy leakages, which may lead to the degradation of the privacy commitments of the applied data privacy algorithm.

  2. 2.

    When there is an increase in the volume, size, and dimensionality of data, the correlation among data also increases. Either the number of correlated data gets increased, or the magnitude of data correlation increases. In some cases, both may happen. Thus, correlated big data has great potential for data privacy leakages.

  3. 3.

    The implementation of differential privacy mechanism usually consumes too many computing resources and a privacy budget. This problem further gets magnified when the data is big data.

  4. 4.

    The increasing magnitude of data correlation, size, and dimensionality of data causes an increase in the overall sensitivity of data. Increased sensitivity will result in the addition of larger noise when applying differential privacy algorithm. Consequently, this results in a highly polluted dataset that can be of very little use after data publication.

Based on the above observations, this paper proposes a novel mechanism to study data correlation using the Distance Correlation Analysis Method, efficiently handle big data using the divide and conquer approach and finally apply differential privacy over the correlated big dataset.

Fig. 3
figure 3

Step 1 of proposed methodology

Fig. 4
figure 4

Step 2 of proposed methodology

Fig. 5
figure 5

Step 4 of proposed methodology

Fig. 6
figure 6

Step 5 of proposed methodology

The proposed mechanism can be summarised as a sequence of steps in the following manner:

  • Step 1—Calculation of Data Correlation—The standard approaches to study data correlation are—Statistical Methods, Mutual Information Correlation Method, and Distance Correlation Analysis Method [5]. The statistical methods of data correlation analysis could only study the linear relationship among data. The MIC could explore the non-linear relationship of data along with the linear relationship, but it was inefficient in handling high dimensional data [4, 5]. But the distance correlation analysis could efficiently handle linear, non-linear, and high dimensional data. Thus in the presented work, we have used the distance correlation analysis method to generate the Data Correlation Matrix instead of the MIC method, which was used in [5].

  • Step 2—Division into smaller data blocks—The enormous size of big data leads to increased overheads. The proposed mechanism divides the data into smaller data blocks to overcome this setback. It is done using a combination of the k-means clustering algorithm and the distance correlation matrix (generated in Step 1). The k-means clustering algorithm, along with the distance correlation matrix, divides the large dataset D into multiple smaller data blocks, i.e., \(D_1, D_2, \ldots , D_n\) such that \(D_1 U D_2 U\ldots U D_n = D\). In other words, \(D_i \Delta D_j = \phi\) where \(i \ne j\).

  • Step 3—Computing Correlation Sensitivity—This paper uses the Correlation Sensitivity method proposed in [5] as it is very appropriate. Correlation sensitivity states that sensitivity must be calculated of any query f only over \(D_i\) instead of the entire dataset D. This helps reduce the noise added for distortion. It also helps to improve data utility.

  • Step 4—Noise Addition—The noise gets calculated using the correlation sensitivity (as calculated in Step 3) and using the Laplace Mechanism. This step solves the problems discussed in observations 3 and 4. It is a crucial step as the working and effectiveness of the differential privacy algorithm primarily depends on the sensitivity parameter. By reducing the magnitude of noise to be added, one can increase the data utility of the protected dataset. The calculated noise is added to the record value to generate a noisy value.

    $$\begin{aligned} \hbox {N}(\hbox {D}) = \hbox {f}(\hbox {D}) + \hbox {noise} \,(\hbox {e}) \end{aligned}$$

    where N(D) is the noisy data, f is the query executed over data D, and noise is the noise value calculated using correlation sensitivity and the Laplace mechanism.

  • Step 5—Implementing Differential Privacy—In our proposed work, we used the parallel combination of differential privacy to apply the differential privacy algorithm over the partitioned dataset. This step is a solution to reduce the overheads associated with high-dimensional datasets. This step must be applied to all the formed data blocks to ensure the overall privacy protection of the big data.

Figures 3, 4, 5, and 6 give a schematic description of the steps of the proposed methodology.

figure a
figure b

Experiment and analysis

Experimental setup

The experimental platform is a laptop with Intel® Core. i5-8250U CPU @ 1.6 GHz 1.80 GHz processor, 8 GB RAM, 64-bit operating system, x64-based processor, Windows 10. We used the MIC and Distance Correlation Analysis method to calculate the correlation between the records and performed a comparative study. We can present the algorithms adopted as Algorithm 1 and 2. Then we compiled the experiments and results and implemented the methodologies using the TensorFlow environment on Google Colabotary, with 13 GB RAM and 108 GB disk. The following subsection gives a detailed description of the datasets used.

Table 3 Dataset description

Dataset description

In the present work, we utilized the following datasets. The main reason for considering these datasets is the existence of an adequate amount of data correlation and their privacy needs. Table 3 gives brief descriptions of the datasets, and Tables 4, 5, and 6 present the statistical features of the respective datasets. During the experiments and observations, these datasets get referred to as Dataset a, Dataset b, and Dataset c.

  1. 1.

    New-York-City trip data 2016: This dataset contains 19 attributes providing information about taxi trips in 2016.

  2. 2.

    Chicago Crime data: This dataset contains 22 attributes providing information about crime incidents. The researchers obtained the crime data of Chicago from the city of Chicago data portal.

  3. 3.

    New-York-City trip data 2013: This dataset contains 26 attributes providing information about taxi trips in 2013.

Table 4 Statistical features of the dataset a
Table 5 Statistical features of the dataset b
Table 6 Statistical features of the dataset c

Analysis and results

We used data correlation analysis, epsilon values, data size, mean average error, and data utility to analyze the proposed methodology results. For ease, we have named the methodology proposed in [5] as cdp-method and the method proposed in the current work as d-method.

Data correlation

There are numerous methods to study data correlation among data. Our proposed methodology used distance correlation, and the method proposed by [5] used mutual information correlation (MIC) for the same. Figures 7 and 8 present the correlation coefficients calculated using distance correlation and mutual information correlation for Dataset a. Datasets b and c adopted the same methodology. One can observe that Distance correlation analysis can measure the data correlation better than MIC.

Epsilon values

Epsilon is an important parameter that affects differential privacy protection, also known as the Privacy Budget. The lower the value of epsilon, the higher the level of privacy protection, which lowers the utility of the data. Higher perturbation results in a higher loss of original data values, but at the same time, it provides higher privacy protection levels. The observations from the experimental analysis suggest that the proposed d-method, when compared with the cdp method [5], provides nearly equal privacy performance. We depicted the above using Fig. 11, where we compared both the methods for privacy performance using three different datasets.

Datasize

As per the traditional assumptions of differential privacy, the privacy protection levels do not vary with the amount of data. This is considered one of the advantages of differential privacy. The comparative analysis of the cdp method [5] and proposed d-method shows that both the methods have a meager impact on privacy performance with increasing data size and have nearly identical curves, as shown in Fig. 12.

Number of clusters

Figure 9 shows the variation of privacy performance epsilon with changing values of r, i.e., the number of data subsets formed from the large datasets. Observations show that a small value of r results in a more significant value of r-MAE, which results in a more unsatisfactory privacy protection performance. Nevertheless, after a threshold value of r, the value of r-MAE becomes stable, and one can obtain optimal privacy protection performance. This result is similar to the development of [5]. Hence, one can say that in terms of variation of the value of r, the proposed method is equally efficient to the cdp-method [5].

$$\begin{aligned} r-MAE = \sum _{i=1,\epsilon }^{r}\frac{MAE_i,_\epsilon }{r} \end{aligned}$$
(11)

Data utility

Data utility and Data Privacy have an inherent trade-off nature. So when we see an incline in the data utility, it implies a decline in data privacy and vice versa. Due to this, the measurement of data utility becomes very important. Various metrics have been used to measure the data utility, which measures the data privacy level. We have used information gain as the metric for measuring data utility. The more the information gained more is the data utility. Table 7 states the information gain values of different clusters using the cdp-method [5] and d-method, respectively. Also, Fig. 10 is the same graphical representation. After comparison of data utility values between the cdp-method [5] and the d-method, we observed that the d-method incurs greater Information Gain values for all the clusters implying a better data utility. Then both the methods are compared against the conventional Differential Privacy Algorithm, and one can observe that the conventional DP provides the least data utility, followed by the cdp-method [5]. The d-method offers the highest Data Utility. Table 8 depicts these values.

Table 7 Information gain values for different clusters using cdp-method [5] and the proposed d-method
Table 8 Information gain values for conventional DP, cdp-method [5] and d-method
Fig. 7
figure 7

Distance correlation matrix

Fig. 8
figure 8

MIC correlation matrix

Fig. 9
figure 9

Epsilon versus r-MAE trend for the different datasets

Fig. 10
figure 10

Information gain values for cdp method [5] versus proposed d method

Fig. 11
figure 11

Epsilon versus MAE Trend for the different datasets

Fig. 12
figure 12

Datasize versus MAE trend for the different datasets

Conclusion

The experimental analysis showed satisfactory results of the proposed mechanism. This paper initially studied how the existence of correlation among data can adversely affect the privacy guarantees of any privacy algorithms. We looked at the above with the help of an extensive literature survey and other experimental analyses. One can observe a noticeable difference when data were clustered: (i) on a general basis and (ii) based on the existing correlation. The clusters formed were very different for the two cases. It further strengthens the notion that correlation dramatically impacts how data gets interpreted and is considered for privacy mechanisms. The proposed mechanism used the distance correlation analysis technique to study the correlation among data in real-world datasets. This correlation analysis technique is selected because it can handle high-dimension, linear, and non-linear data, as most real-world datasets are high-dimensional and non-linear by nature. To address the large size of the real-world dataset, the proposed mechanism has used the cdp-method, i.e., the divide and conquer approach, further combined with the parallel combination of Differential Privacy Protection Mechanism and PySpark technology to ensure the privacy of data. Results showed that the distance correlation analysis method is better than the MIC method in correlation analysis and data utility. Other results were similar to the results of the cdp-method, which proved that the proposed methodology provides better data utility while maintaining the data privacy levels offered by the cdp-method. We can shortly summarize these observations as the following:

  1. 1.

    This paper studied the adverse effect of data correlation in real-world datasets.

  2. 2.

    We used the distance correlation analysis technique to study the correlation among data. It could efficiently handle high-dimensional data, unlike the traditional MIC analysis method.

  3. 3.

    Divide and conquer methodology is used to handle the big dataset, and a parallel combination of Differential Privacy Mechanism ensures privacy protection for the entire dataset. This is similar to the cdp-method.

  4. 4.

    Experimental results showed that distance correlation offered the required privacy protection of correlated data and maximum data utility.

    $$\begin{aligned} DataUtility(Conventional DP)< DataUtility(cdp-method) < DataUtility(d-method) \end{aligned}$$
  5. 5.

    The proposed work utilized the big data framework, ‘PySpark,’ to handle Big Data efficiently. The researchers of the cdp-method used no such technology.

Availability of data and materials

All relevant research data and materials are available with the authors.

References

  1. Liang JY, Feng CJ, Song P. A survey on correlation analysis of big data. J Soft. 2016;01(39):1–18.

    MathSciNet  Google Scholar 

  2. Reshef D, Reshef Y, Finucane H, Grossman S, McVean G, Turnbaugh P, et al. Detecting novel associations in large data sets. Science (New York, NY). 2011;12(334):1518–24.

    Article  MATH  Google Scholar 

  3. Abdalla HB. A brief survey on big data: technologies, terminologies and data-intensive applications. J Big Data. 2022;11:9.

    Google Scholar 

  4. Biswas S, Khare N, Agrawal P, Jain P. Machine learning concepts for correlated Big Data privacy. J Big Data. 2021;12:1.

    Google Scholar 

  5. Lv D, Zhu S. Achieving correlated differential privacy of big data publication. Comput Secur. 2019;05:82.

    Google Scholar 

  6. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M. L-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov Data. 2007;1(1):3-es.

    Article  Google Scholar 

  7. Li N, Li T, Venkatasubramanian S. t-closeness: privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd international conference on data engineering; 2007. p. 106–15.

  8. Dwork C. Differential privacy. In: 33rd international colloquium on automata, languages and programming, part II (ICALP 2006). vol. 4052 of lecture notes in computer science. New York: Springer; 2006. p. 1–12. Available from: https://www.microsoft.com/en-us/research/publication/differential-privacy/.

  9. Yang X, Wang T, Ren X, Yu W. Survey on improving data utility in differentially private sequential data publishing. IEEE Trans Big Data. 2017;1:1.

    Google Scholar 

  10. Jain P, Gyanchandani M, Khare N. Big data privacy: a technological perspective and review. J Big Data. 2016;11:3.

    Google Scholar 

  11. Jain P, Gyanchandani M, Khare N. Differential privacy: its technological prescriptive using big data. J Big Data. 2018;04:5.

    Google Scholar 

  12. Zhu T, Xiong P, Li G, Zhou W. Correlated differential privacy: hiding information in non-IID data set. IEEE Trans Inf Forensics Secur. 2015;10(2):229–42.

    Article  Google Scholar 

  13. Kifer D, Machanavajjhala A. No free lunch in data privacy. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data. SIGMOD’11. New York, NY, USA: Association for Computing Machinery; 2011. p. 193–204. Available from: https://doi.org/10.1145/1989323.1989345.

  14. Wu G, Xia X, He Y. Extending differential privacy for treating dependent records via information theory; 2017, 03.

  15. Zhao J, Zhang J, Poor HV. Dependent differential privacy for correlated data; 2017. p. 1–7.

  16. Li Y, Ren X, Yang S, Yang X. Impact of prior knowledge and data correlation on privacy leakage: a unified analysis. IEEE Trans Inf Forensics Secur. 2019;14(9):2342–57.

    Article  Google Scholar 

  17. Kumar S, Chong I. Correlation analysis to identify the effective data in machine learning: prediction of depressive disorder and emotion states. Int J Environ Res Public Health. 2018;15(12):1. Available from: https://www.mdpi.com/1660-4601/15/12/2907.

  18. Yang X, Wang Teng RXYW. Survey on improving data utility in differentially private sequential data publishing. IEEE Trans Big Data. 2017;1:1.

    Google Scholar 

  19. Gehrke J, Lui E, Pass R. Towards privacy for social networks: a zero-knowledge based definition of privacy. In: Ishai Y, editor. Theory of cryptography. Berlin: Springer; 2011. p. 432–49.

    Chapter  MATH  Google Scholar 

  20. Belcastro CRMFe. Programming big data analysis: principles and solutions. J Big Data. 2022;01:9.

    Google Scholar 

  21. Kifer D, Machanavajjhala A. Pufferfish: a framework for mathematical privacy definitions. ACM Trans Datab Syst (TODS). 2014;01:39.

    MATH  Google Scholar 

  22. Yang B, Sato I, Nakagawa H. Bayesian differential privacy on correlated data; 2015.

  23. Wang Y, Song S, Chaudhuri K. Privacy-preserving analysis of correlated data; 2016. arXiv:1603.03977.

  24. Chen J, Ma H, Zhao D, Liu L. Correlated differential privacy protection for mobile crowdsensing. IEEE Trans Big Data. 2017;12(PP):1.

    Article  Google Scholar 

  25. Cao L, Ou Y, Yu P. Coupled behavior analysis with applications. IEEE Trans Knowl Data Eng. 2012;08(24):1.

    Google Scholar 

  26. Cao L. Non-IIDness learning in behavioral and social data. Comput J. 2013;08(57):1358–70.

    Google Scholar 

  27. Chen R, Fung B, Yu P, Desai B. Correlated network data publication via differential privacy. VLDB J. 2014;08(23):653–76.

    Article  Google Scholar 

  28. Hemkumar D, Ravichandra S, Somayajulu DVLN. Impact of data correlation on privacy budget allocation in continuous publication of location statistics. Peer-to-Peer Netw Appl. 2021;14(3):1650–65.

    Article  Google Scholar 

  29. Chen J, Ma H, Zhao D, Liu L. Correlated differential privacy protection for mobile crowdsensing. IEEE Trans Big Data. 2021;7(04):1.

    Google Scholar 

  30. Cerruto CSDDEAF. Social network data analysis to highlight privacy threats in sharing data. J Big Data. 2022;02:9.

    Google Scholar 

  31. Song Y, Cao L, Wu X, Wei G, Ye W, Ding W. Coupled behavior analysis for capturing coupling relationships in group-based market manipulations. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining; 2012.

  32. Dwork C, Roth A. The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci. 2014;9(3–4):211–407. Available from: https://doi.org/10.1561/0400000042.

  33. Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Ann Stat. 2007;35(6):2769–94. Available from: http://www.jstor.org/stable/25464608.

  34. Jain P, Gyanchandani M, Khare N. Enhanced secured map reduce layer for big data privacy and security. J Big Data. 2019;03:6.

    Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

SB conducted the literature review process, proposed the novel approach, wrote the manuscript, extracted the results of the experiments and arranged them in a presentable manner. AF helped with the presentation of the manuscript, performing the experiments and preparation of the figures. NK has helped with his technical expertise throughout the research work. He has been a constant guide from the beginning to the end. PA has helped with the final formatting of the paper and preparation of several tables. All the authors reviewed the manuscript.

Authors’ information

Sreemoyee Biswas is currently pursuing a Ph.D. in Computer Science and Engineering from Maulana Azad National Institute of Technology, Bhopal, India. Her field of research is “Big Data Privacy.” Other areas of specialization include Data Privacy, Information Security, and Machine Learning. She has about two years of experience as an Assistant Professor. Her Educational Qualification is M.Tech & B.E. in Computer Science and Engineering. Ms. Sreemoyee Biswas has publications in SCI indexed journals, Scopus indexed journals & National Conference.

Anuja Fole has completed her Mtech in Advanced Computing specialization under Computer Science branch from Maulana Azad National Institute of Technology.She is currently working as Data Scientist. Her areas of specialization are Machine Learning, Big data and Data privacy. Currently her interest of research is “Big Data”. She has two years of industry experience as Associate Software Engineer. Her educational qualification is B.E in Information Technology.

Nilay Khare is working as Professor in MANIT Bhopal. He has more than 21 years of experience. His Educational Qualification is a Ph.D. in Computer Science & Engineering. Dr. Nilay Khare’s areas of Specialization are Big Data, Big Data Privacy & Security, Wireless Networks, Theoretical computer science. He has publications in 54 International and National Conferences and International Journal. He is a Life Member of ISTE.

Pragati Agrawal is working as Assistant Professor in MANIT Bhopal. She has more than five years of experience. Her Educational Qualification is a Ph.D. in Computer Science & Engineering. Dr. Pragati Agrawal’s areas of Specialization are Theoretical Computer Science, Energy Efficiency. Dr. Pragati Agrawal’s publications are in International and National Conferences and International Journal. She is a Life Member of IEEE and ACM.

Corresponding author

Correspondence to Sreemoyee Biswas.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors have given consent for publication of the matter.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biswas, S., Fole, A., Khare, N. et al. Enhancing correlated big data privacy using differential privacy and machine learning. J Big Data 10, 30 (2023). https://doi.org/10.1186/s40537-023-00705-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-023-00705-8

Keywords