 Survey Paper
 Open Access
 Published:
Machine learning concepts for correlated Big Data privacy
Journal of Big Data volume 8, Article number: 157 (2021)
Abstract
With data becoming a salient asset worldwide, dependence amongst data kept on growing. Hence the realworld datasets that one works upon in today’s time are highly correlated. Since the past few years, researchers have given attention to this aspect of data privacy and found a correlation among data. The existing data privacy guarantees cannot assure the expected data privacy algorithms. The privacy guarantees provided by existing algorithms were enough when there existed no relation between data in the datasets. Hence, by keeping the existence of data correlation into account, there is a dire need to reconsider the privacy algorithms. Some of the research has considered utilizing a wellknown machine learning concept, i.e., Data Correlation Analysis, to understand the relationship between data in a better way. This concept has given some promising results as well. Though it is still concise, the researchers did a considerable amount of research on correlated data privacy. Researchers have provided solutions using probabilistic models, behavioral analysis, sensitivity analysis, information theory models, statistical correlation analysis, exhaustive combination analysis, temporal privacy leakages, and weighted hierarchical graphs. Nevertheless, researchers are doing work upon the realworld datasets that are often large (technologically termed big data) and house a high amount of data correlation. Firstly, the data correlation in big data must be studied. Researchers are exploring different analysis techniques to find the best suitable. Then, they might suggest a measure to guarantee privacy for correlated big data. This survey paper presents a detailed survey of the methods proposed by different researchers to deal with the problem of correlated data privacy and correlated big data privacy and highlights the future scope in this area. The quantitative analysis of the reviewed articles suggests that data correlation is a significant threat to data privacy. This threat further gets magnified with big data. While considering and analyzing data correlation, then parameters such as Maximum queries executed, Mean average error values show better results when compared with other methods. Hence, there is a grave need to understand and propose solutions for correlated big data privacy.
Introduction
Data Privacy is the appropriate use of data available with any individual or organization, unlike data security that guarantees confidentiality, integrity, and data availability. Some of the wellknown data privacy preservation methods are kanonymization [1, 2], ldiversity [2, 3], tcloseness [4], and differential privacy [5]. Table 1 summarizes these algorithms along with their pros and cons. Among these, kanonymization, ldiversity, tcloseness fall under the category of Deidentification algorithms. kanonymization [1, 2] works on the principle of making each record identical to at least k1 records, over a set of attributes called quasiidentifiers. It protects against linkage attacks by making k records indistinguishable from each other. The larger the value of k, the higher the privacy, and consequently, the data utility is lower. Despite introducing a larger value of k, there may be cases where the sensitive data in the equivalence class do not exhibit diversity. In that case, the dataset becomes vulnerable to homogeneity attacks. In order to prevent it, ldiversity [2, 3] ensures that the equivalence class should contain l wellrepresented values for each sensitive attribute. tcloseness [4] is an improved version of ldiversity where it ensures that the distance between the distribution of a sensitive attribute in the equivalence class and the distribution of the attribute in the whole table is not more than a threshold value t. Those mentioned above could prevent distribution attacks on datasets to a large extent. Data analysts and researchers used another mechanism, i,e., differential privacy [5] to obtain helpful information from the database and also ensured that its privacy is maintained when publishing it. The basic principle behind it is to introduce distortion in the database before publishing. Nevertheless, the distortion should be large enough to provide high privacy, and at the same time, it should be small enough to ensure data utility after data publishing [6, 7]. Hence to calculate the optimum amount of distortion, researchers apply many metrics. The traditional DP used global sensitivity [5]. Later researchers have also suggested other methods for it, in order to provide privacy can along with data utility [8,9,10]. Table 1 gives a brief desciption of the above mentioned mechanisms. Due to the widespread use of DP, the performing of rigorous research works were as done on it. Along with some other drawbacks, researchers identified a vital drawback that could threaten data privacy. It was the existence of correlation in the datasets [9, 11, 12].
Initial researches around DP ignored its existence and regarded data as IID. However, later researches showed that real datasets often had a high correlation among them, and hence researches including data correlation became primarily significant [10, 13, 14]. Many researchers have studied the adverse effects of data correlation on data privacy. Different researchers in their published works have proposed various approaches to deal with the issue of privacy threats due to data correlation in datasets. For this, many researchers have used a wellknown machine learning technique, termed Data Correlation [15]. Along with ensuring data privacy in data correlation, another big concern is maintaining data utility, for which various algorithms also have been proposed.
The abovestated problem of Data Correlation as a data privacy threat further gets strengthened when the data is of large volume and involves more complexities, and higher dimensionalities [16]. The technical term for such data is “Big Data.” So our concern increases more with the presence of big data in most of the fields. Due to the exhaustive data generation, big data is almost present everywhere, and with general data approaching big data, the magnitude of data correlation among the datasets also increases proportionally. Hence, the need to work towards the privacy of correlated big data increases manifold. In its later section, this paper also describes the structure and organization of big data, which is fundamental to big data privacy and correlated big data privacy.
This work outlines all the related works where data correlation has threatened data privacy and big data privacy. This work will be helpful to all those who wish to explore the same area further as it provides a detailed study of the critical points, the proposed methodology, the conclusions, and the limitations of the related works.
Machine learning for data privacy
Machine learning holds multidimensional capacity and witnesses its applications in varied fields such as speech recognition, online fraud detection, image recognition, product recognition, etc. [17]. Sensor machine learning is a comparatively new application of machine learning that uses various machine learning algorithms to work with sensor data. It is of great industrial importance as well. Through the usage of accurate sensor data and machine learning algorithms, the failure of heavy machines can be predicted well before time to minimize loss [18]. Also, several research papers have suggested algorithms to facilitate machine learning algorithms for sensor data [19, 20]. Another application of machine learning that is paving its way is in big data privacy. A large amount of data correlation poses a threat to their privacy. The researchers can ensure data privacy guarantees by utilizing machine learning concepts. By using one such concept of ML is Data correlation, they can analyze the relationship among data and later prevent it from being a threat. [13] uses MIC, one of the existing data correlation analysis techniques for this purpose. Similarly, other ML algorithms can also be analyzed and applied. The works discussed in [21, 22] have explored various correlation analysis techniques.
Fundamentals of Big Data
“A new generation of technologies and architectures built to economically separate value from massive volumes of a wide variety of data by allowing highvelocity collection, discovery, and analysis,” according to the definition of big data [23]. According to this concept, the 3Vs reflect the properties of big data, which are volume, velocity, and variety. Later research has shown that the 3Vs definition is inadequate to understand the current state of big data. In order to correctly understand the definition of big data, the additions of veracity, validity, value, variability, venue, vocabulary, and vagueness led to the making of some complement descriptions of big data. The fact that big data can include text, audio, image, video, and other types of data is a common theme. Variety represents the various qualities of data. The processing and analysis of data that is too huge or complicated for traditional database systems are handled by a big data architecture, as described in Fig. 1. Batch processing of gigantic data sources at rest, realtime processing of big data in motion, interactive exploration of big data, predictive analytics, and machine learning are everyday responsibilities in big data management.
Various frameworks have been established in recent years to ensure big data privacy. Given the massive amount of data and the combination of structured and unstructured data, some new Big Data models are a need to improve privacy and protection. The algorithm builds on existing privacypreserving data techniques, resulting in a new model that incorporates a new Enhanced Secured Map Reduce (ESMR) layer [24] of privacy on the Big Data architecture map reduces phase. As the data passes through the mapreduce process, this new layer applies the protection algorithms to each specific piece of data. When data processing occurs via this latest proposed ESMR layer of Big Data, it can be safe and secured.
In today’s digital world, where they store lots of information in big data, the analysis of the databases can provide opportunities to solve big problems of society like Healthcare, Sensors, Satellites, Share Market, Election, Crop prediction, and others. Another field that has emerged as a great application of big data is remote sensing. Remote sensing holds large scope in newer applications such as a buildup of strategies for reduction of resources consumption, timely disaster forecast, monitoring global changes, etc. [25]. Some conventional remote application areas are sensing image classification [26], Crop Classification [27], Land Cover and Land Use Classification [28], Satellite Image Classification [28] and many more. Research papers [25, 29,30,31] presents a detailed review on the need and use of Machine Learning in remote sensing. Authors of [25] also used examples of commercial players that are using remote sensing for Earth observation to advocate the need for machine learning in remote sensing. Remote sensing in itself holds the great capacity to be explored. It has a plethora of applications in varied fields, and the application of machine learning in the field of remote sensing can be a great boon.
Big Data is often challenged with many privacy issues [32] Its application in varied fields calls for stringent privacy measures [33]. One of the recent privacy threats is the presence of data correlation in big datasets. The researchers discussed it in detail in this paper.
Organisation of the paper
This paper presents a detailed review of research works from 2011 to 2019 that stated that existing algorithms are not enough to deal with the problem and proposed algorithms to resolve this issue. After the introduction of the paper, the following section describes, in brief, the basics of big data, which is fundamental to the topic of this paper. The following section is about the literature review, which in compact states the research papers and highlights the importance of each of them. The following section discusses the research papers that were the pioneers in understanding the problem of correlated data and classified it as a threat to data privacy. The upcoming section throws light and discusses in detail the research papers which have tried to provide some solution to the stated problem. The division of section is into several subsections as per the domain of the methodology proposed. The coming section deals with the experimental results of the experiments conducted to compare various proposed methods and analyze them based on few selected parameters. The last section states the conclusion and the open research areas and describes them in detail. Table 2 is a list of abbreviations used in this work.
Literature review
Initially, while people have dwelled upon data privacy, they inherently assumed no relations among data exist. Furthermore, all the proposed mechanisms for data privacy had an implicit assumption that they will be working upon independent data sets. Gradually, researchers and scholars around the globe started noticing that getting an independent data set has negligible chances in a reallife scenario. Most of the data sets upon which researchers and scholars were working had Data correlation as it was the realistic approach to understand the data and the working of various proposed mechanisms. Over the past few years, researchers have been working on analyzing the effects of Data Correlation over data privacy. Some of them had also proposed algorithms that would modify the existing privacy techniques for cinching privacy for correlated data.
Gehrke et al. [34] in the year 2011 and Kifer et al. [11] in 2011 gave initial arguments that considering the correlation between records is pivotal as the correlation between records or attributes can substantially decrease the privacy guarantee provided by any algorithm. These were the earliest works that realized the existence of data correlation within datasets and identified them as a threat to data privacy. They were the pioneers in this but severely lacked practical application. [35,36,37,38] deal with a special case of correlated data i.e. behavior analysis of the stock market. It is due to that the data of various investors and traders are often not mutually independent. The behavioral coupling between investors–traders and investors–investors is frequently noticed.
Cao et al. [35] in 2012 used an example of a stock market to study Coupled Behaviour Analysis. Cao et al. [35] were the first to do some research in Coupled Behaviour Analysis. This work also discussed Multivariate time series, sequence analysis, and Coupled Hidden Markov Model (CHMM) to study the abnormal behavior of databases such as a trading database. The proposed algorithms in this work were tested on a real dataset of the Asian Stock Exchange from 1st June 2004 to 31st December 2005. Longbing Cao [39] in 2013 presented another study about nonIIDness of data which analyzed the IIDness of several classical algorithms and showed how they might not work correctly in the case of nonIID data. To measure the nonIIDness, authors in this paper have used a measure termed Coupled Item Similarity (CIS). This work suggested a method to study the dependency among data but did not suggest any practical implementation to ensure data privacy.
Kifer et al. [40] in their successive work in the year 2014, proposed a privacy mechanism called Pufferfish Mechanism. The works of other researchers have used the proposed Pufferfish mechanism as a part of their work towards privacy. However, the proposed framework did not satisfy the Differential Privacy, and hence privacy preservation for Correlated Data still called for further investigation. Yang et al. [41] in 2015, proposed another mechanism in which they used Bayesian principles to help Differential Privacy cope up with correlated data. The mechanism proposed in this was very efficient, but it had the same disadvantages as a Bayesian Model. Wang et al. [6] in the year 2016, used Pufferfish Mechanism (PM) along with Wasserstein Distance to ensure privacy for correlated data with the help of two examplesPhysical activity monitoring and Flu status. They proposed that the added noise must be proportional to Wasserstein Distance. The authors concluded that the proposed mechanism performs satisfactorily for a fixed range of values and fails for the remaining range. Chen et al. [7] in 2017, proposed a perturbation mechanism for Mobile Crowd Sensing (MCS) data using Bayesian Network that tries to add noise with the increased utility of published data. The proposed mechanism requires full knowledge of the probabilistic relationship among records which is not always practically feasible.
Chen et al. [8] in 2013, with the help of Social Network data, proposed multiplication of global sensitivity with the no. of correlated records to deal with the issue. The proposed method sets the major drawback of high data utility loss. Zhu et al. [9] in 2015, shown in this work that how the solution proposed in [8] induced an enormous magnitude of noise which significantly damaged the utility of data. They also proposed a different sensitivity analysis method called Correlated Sensitivity which reduced the amount of noise compared to global sensitivity. In the later sections of this paper, one can find the description of correlated sensitivity. Liu et al. [10] in the year 2016, proposed a perturbation mechanism for dependent data called Dependent Differential Privacy (DDP) to deal with the problem of correlated data. Liu et al. have performed an inference attack over a realworld dataset to demonstrate how an adversary can infer a user’s sensitive information by using the noise added query outputs and exploiting the user’s social relationships, thus violating the Differential Privacy guarantees. The proposed perturbation mechanism has added an extra parameter called the dependence coefficient to measure the dependence relationship between tuples minutely. It involves the estimation of \(\rho _{ij}\) value. The key challenge of the proposed approach is to accurately compute the value of \(\rho _{ij}\) as it relies on probabilistic models of statistical data.
Wu et al. [12] in the year 2017 proposed the use of foundations of Information Theory to study the relationship between Information Theory Principles and Differential Privacy. The authors concluded that concepts of Information Theory are well suited to model the problems of dependent data in Differential Privacy. This work did not study other privacy leakages, i.e., whether using concepts of Information theory induces any other privacy leakages or not. In order to practically suggest its application, one might study other privacy leakages.
Kumar et al. [42] in 2018 used Pearson’s Product Moment correlation method to study the relationship between data. The datasets used wereDepressive disorder symptom dataset for evaluating depression severity, the Local weather dataset for classifying depression severity, and the Physiological sensor dataset for emotion detection. However, the authors of [42] must use a better correlation analysis technique as Pearson’s Product Moment could not correctly define relationships among independent variables and other complex relationships.
Lv et al. [13] in 2019, studied data correlation among a correlated big data set to ensure its data privacy. They have used the MIC (Mutual information correlation) algorithm to analyze data correlation based on Information Theory and Mutual Information Theory. The authors also proposed two Correlated Differential Privacy models for Big Data Privacy Protection: (1) kCorrelated Record Differential Privacy and (2) r Correlated Block Differential Privacy. For the experimental analysis, they have chosen the National Air Quality Data as timeseries data are usually highly correlated. For examination purposes, the proposed mechanism is compared to mechanisms proposed in [8, 9] and [10] by setting different parameters to suitable values. Nevertheless, the proposed approach has a few shortcomings, which provide an open scope for further research in this field.
Zhao et al. [10] in 2019, considered and did all feasible combinations by combining tuple correlations and the query responses to provide a slight modification to differential privacy to solve the abovestated problem.
Cao et al. [43] in the same year took up temporal data to further understand the problem. The main objectives of this work wereAnalysing temporal privacy leakage and quantify temporal privacy leakage. The researchers used the Concepts of Backward privacy leakage (BPL) and Forward privacy leakage (FPL) for this purpose. The future scope of this work lies in studying privacy leaks under temporal correlations combined with other types of correlation models.
Hemkumar et al. [45] in the year 2021 took up the same lines and studied temporal correlation. They proposed wevent privacy to deal with the problem. The proposed methodology lacks in the study of the correlation between other values.
Li et al. [14] later in the same year, took small examples to explain cases of positive correlation, negative correlation, and no correlation, along with how an adversary can use them with little or full prior knowledge. Mechanisms proposed in [46,47,48], and [41] based on the Bayesian Inference method can only be applied in case of positive correlation, unlike the proposed Prior Differential Privacy (PDP) mechanism. The authors have initially used the Weighted Hierarchical Graph technique as a solution, but later, they have pointed out its disadvantages and have applied MultiVariate Gaussian Model. The paper further describes the observations and the results derived. Nevertheless, the proposed analysis was applied only to linear queries in the current work.
Further in the same year, in the work [49], the proposed perturbation mechanism is based on the minimization of Laplace noise using the Bayesian network model. Subsequently, the authors Chen et al. conducted simulations to evaluate the proposed algorithms and demonstrated their effectiveness. They also proved the influence of the maximum correlated group on the Bayesian differential privacy mechanism by using the Gaussian correlation model.
Figure 2 gives an illustration of the categories of the proposed solution. In this, they have categorized based on the technique used. Table 3 summarizes the proposed methodologies and their pros and cons in a tabular format for easy and better understanding.
Realization of correlated data as a threat
Gehrke et al. [34] in the year 2011, considered the case of social networks where users and their data are highly correlated, and even the strong privacy guarantee provided by Differential Privacy could not assure privacy for Social Network settings. It was not a direct analysis of correlated data and its effects, but it was an indirect study of the effects of data correlation. This work also emphasized achieving a ZeroKnowledge Definition of Privacy along the lines of Dalenius. It states that an adversary must have Zero additional Knowledge by accessing the proposed mechanism. Authors have also formalized the notion of ’Aggregated Information,’ which would be acceptable to release to attain the desired utility level without compromising privacy. The proposed ZeroKnowledge Privacy mechanism makes use of the Aggregated Information parameter. In the proposed work, mechanisms, adversaries, and simulators are simply randomized algorithms that play certain roles in our definitions. Let San be a mechanism that operates on databases in \(D_{1} and D\) be the class of all databases. For any database \(D_{1} \in D\), any adversary A, and any \(z \in \{0, 1 \}^{*}\), let \(OutA(A(z) \Leftrightarrow San(D1))\) denote the random variable representing the output of A on input z after interacting with the mechanism San operating on the database \(D_{1}\). Note that San can be interactive or noninteractive. If San is noninteractive, then \(San(D_{1})\) sends information (e.g., a sanitized database) to A and then halts immediately; the adversary A then tries to breach the privacy of some individual in the database \(D_{1}\). Let agg be any class of randomized algorithms that provide aggregate information to simulators.
Definition 1
ZeroKnowledge Privacy with aggregate We say that San is \(\epsilon\) zeroknowledge private with respect to agg if there exists a \(T \in agg\) such that for every adversary A, there exists a simulator S such that for every database \(D_{1} \in X_{n}\), every \(z \in \{0, 1 \}^{^{*}}\), every integer \(i \in [n]\), and every \(W \in \{0, 1 \}^{*}\), the following hold:
The probabilities are over the random coins of San and A and T and S, respectively. Equations (1) and (2) means that whatever adversary can compute without accessing the mechanism, the same can be accessed by accessing the mechanism but with certain aggregate information. By properly setting the value of this parameter, the researchers might achieve its optimum value. Also, Gehrke et al. [34] have proved mathematically that Differential Privacy is a particular case of the proposed ZeroKnowledge Privacy mechanism. This work was among the initial research papers that tried to understand the dependencies of data though not directly but with the use of Social networks and their analysis. Some limitations of the proposed work are lack of application as the approach was purely theoretical, and there is no evidence of data utility preservation.
Kifer et al. [11] in 2011, gave initial arguments that considering the correlation between records is pivotal as the correlation between records or attributes can substantially decrease the privacy guarantee provided by any algorithm. This was the first attempt to formalize the term Data Correlation as a general phenomenon for realtime datasets. Later this work formed a significant foundation for further research regarding correlated datasets. This work identified the existence of data correlation and its adverse effects on data privacy but did not provide enough solutions to deal with it.
Various solutions proposed by other researchers
Proposed solutions using probabilistic models
Kifer et al. [40] in their successive work in the year 2014, proposed a privacy mechanism called Pufferfish, which helped in developing privacy definitions for different data sharing needs, study existing privacy definitions, study privacy compromise due to nonindependent data records, and the number of other issues which are critical in terms of privacy. The proposed Pufferfish mechanism mainly depends on three foundation components. They are: (a) Set of Potential Secrets—What is to be protected? (b) The Set of Discriminative Pairs (Si, Sj)—The attacker should not be able to distinguish between Si and Sj, which is a desirable privacy guarantee. (c) The Evolution Scenario This includes knowledge of how the data has evolved and the attackers’ potential. The authors of [40] are discussing these three components in detail with the help of detailed examples in [40]. They also express the primary guarantees of the proposed Pufferfish mechanism in terms of Odds and Odds ratios. As per their definitions in this work, if E1 and E2 are two mutually exclusive events, then the fraction \(\frac{P(E1)}{P(E2)}\) is their Prior Odds. If the ratio is equal to \(\alpha\) (say), that means E1 is \(\alpha\) times as likely as E2. If A is the piece of information available then,\(\frac{P(E1 \mid A)}{P(E2 \mid A)}\) is the Posterior Odds. And the ratio \(\frac{P(E1 \mid A)}{P(E2 \mid A)}\div \frac{(P(E1)}{(P(E2)}\) is the Odds Ratio and if \(\frac{(P(E1 \mid A))/(P(E2 \mid A))}{(P(E1))/(P(E2))} \approx 1\) then one can say that A did not furnish any information that facilitated the differentiation between E1 and E2. Pufferfish provides this semantic guarantee. In this work, the concept of Hedging Privacy is used, which provides good privacy levels. It is strengthening privacy algorithms weaker than Differential Privacy. Specifically, it strengthens single prior privacy to protect such data publishers who have poor data models as prior belief. The set of potential secrets factors and set of discriminative pair factors remain the same. However, the evolution scenario factor changes by including conditional probabilities and other fixedvalued parameters containing all conditional probabilities. They use this for histogram publishing with privacy guarantees of Pufferfish in the proposed work. Also, in this work, Kifer et al. [40] mathematically proved that the Pufferfish framework provides the following with the help of various Lemma, Theorems, an: (a) Protection of Continuous attributes (b) Protection of Secrets that are aggregate properties of data (c) Protection of Distributional Privacy. This work also discussed other points related to privacy and Differential Privacy. As we have observed in the works of other researchers, they have used the proposed Pufferfish mechanism as a part of their work towards privacy. However, the proposed framework did not satisfy the Differential Privacy, and hence privacy preservation for Correlated Data still called for further investigation.
Yang et al. [41] in 2015, proposed another mechanism in which they used Bayesian principles to help Differential Privacy cope up with correlated data. The proposed Bayesian Differential Privacy provides privacy as per formal commitments of Differential Privacy. The above also provided privacy for correlated data and against an adversary with partial background knowledge. They did this by constructing a Gaussian Correlation Model to describe correlated data with complex correlations using the proposed algorithm, which efficiently prevented it from being intractable.
Definition 5
Bayesian Differential Privacy. Let A = A(i,K) be an adversary \((K \subseteq [n] / \{ i \})\) and \(M(x) = Pr(r \in S \mid x)\) be a randomized perturbation mechanism on database x.
The Bayesian differential privacy leakage of M w.r.t. A is
We say M satisfies \(\epsilon\)Bayesian differential privacy, or \(\epsilon\)BDP, if \(sup_{A} BDPLA(M) \le \epsilon\). Further, we could calculate the values of BDPLA(M) for discrete and continuous values of database x using ([41], Eqs. 6, and 7, respectively).
Wang et al. [6] in the year 2016, used Pufferfish Mechanism (PM) along with Wasserstein Distance to ensure privacy for correlated data because Pufferfish is capable of hiding personal values against correlation among multiple entries, unlike Differential Privacy and PM is capable of providing better utility in situations where large no. of entries are correlated [40]. For the demonstration, the authors have used two examples—Physical activity monitoring and Flu status. Using conventional definitions of global sensitivity and the Pufferfish privacy mechanism, they proposed that the added noise must be proportional to Wasserstein distance.
Definition 6
pWasserstein Distance. Let (X, d) be a Radon space, and u,v be two probability measures on X with finite pth moment. Then the pth Wasserstein distance between u and v is defined as:
where \(\Gamma\)(u,v) is the set of all couplings over u and v and each \(\gamma \in \Gamma (u,v)\) can be regarded as a way to shift probability mass between u and v the cost of a shift is \(E (X, Y) \approx [d(X,Y)^{p} ]^{1/p}\). The cost of the mincost shift is the Wasserstein distance. Initially, the researchers applied the proposed version of Pufferfish using Wasserstein distance to the examples mentioned above. They observed that the amount of noise added would be less than the amount of noise added if they use global sensitivity. Hence utility was enhanced. Then this result was generalized using mathematical proofs supported by a no. of theorems. Nevertheless, the proposed mechanism could have been computationally costly. Therefore, the authors have used a more restricted setting using a Bayesian network to describe the dependence between Markov Quilt Mechanism entries. It stated that if nodes \(X_{i}\) and \(X_{j}\) are far apart in a graph g, then \(X_{j}\) is largely independent of \(X_{i}\) and the amount of noise needed to obscure \(X_{i}\) will only be proportional to the local nodes around \(X_{i}\). This work further explained how to select which nodes are local and add small effects concerning the distant nodes. Case studies of earlier mentioned examples initially infer the better utility of the proposed mechanism and later were supported by mathematical proofs. In order to show that the proposed Markov Quilt Mechanism also offered considerable privacy, the researchers used simulation results compared them to [44]. Ultimately, the authors concluded that the proposed Markov Quilt Mechanism performs significantly better than the method proposed in [44] for a certain range of parameter values and slightly worse for the remaining range.
Chen et al. [7] in 2017 initially showed how correlated data could be a threat to privacy in Mobile Crowd Sensing (MCS). They proposed a perturbation mechanism for the same using Bayesian Network that tries to add noise with the increased utility of published data. They presumed that the aggregate server has complete knowledge of the probabilistic relationship among records. They have applied the proposed perturbation mechanism to multiple available datasets like the ADULT dataset from the UCI Machine learning repository, NLTCS dataset from StatLib. Search logs generated by interpolating Google Trends data and American Online Search Logs and compared with [8] and [9] and concluded that the proposed mechanism provided much higher utility and also provided a decreased privacy budget. Another product of the proposed mechanism is the study of the influence of the size of the maximum correlated data groups over the noise generated for perturbation. It showed how they could exclude it to produce less noise when it has a low impact on statistical results. They assumed that the aggregate server has complete knowledge of the probabilistic relationship among records. However, this may not always be practically feasible. Nevertheless, to its relief, one may often not require to model the Bayesian Network over the entire dataset as a relationship between only correlated data is rudimentary to model. Fortunately, they are very few.
In [49], authors Chen et al. firstly investigated the influence of sensing data correlation on differential privacy protection. Due to the complex and diverse relationship among sensing data, different correlation models were explored by them, and accordingly, perturbation mechanisms are proposed from the attacker’s point of view and the data owner’s point of view. The proposed perturbation mechanism is based on the minimization of Laplace noise using the Bayesian network model. Subsequently, they conducted simulations to evaluate the proposed algorithms and demonstrated their effectiveness. They also proved the influence of the maximum correlated group on the Bayesian differential privacy mechanism by using the Gaussian correlation model.
Proposed solutions using behavioural and similarity analysis
Cao et al. [35] in 2012 used an example of a stock market to study Coupled Behaviour Analysis. The works of [36,37,38] also threw light on the same particular type of Correlated data, i.e., coupling between different users and their behaviors, by using the same example of the stock market. When the authors of [35] correlated activities of one or more actors with each other, then such activities are termed Coupled Behaviour(CB), and their analysis as Coupled Behaviour Analysis (CBA). This work studies CB and CBA and how they can be a challenge. Cao et al. [35] were the first to do some research in Coupled Behaviour Analysis. They described A behavior (IB) as a fouringredient tuple \(IB = (\epsilon , O, C, R)\) where actor \(\epsilon\) is the one who does a behavior or behavior is imposed upon the actor. Operation O is what an actor does to accomplish the goal. Context C is the environment where the behavior occurs, and Relationship R is a tuple that defines complex interactions between multiple actors. [35] Also, in [35], they have defined a BehaviorFeature Matrix FM(IB) as follows:
Here it is assumed that there are I actors (customers) \({ \epsilon _{1}, \epsilon _{2},....., \epsilon _{I} },\) an actor \(\epsilon _{i}\) undertakes \(J_{i}\) behaviors \({IB_{i1}, IB_{i2},.........., IB_{iJ}},\) actor \(\epsilon _{i}'s j^{th}\) behaviour \(IB_{ij}\) is a Kvariable vector, it’s variable \([p_{ij}]_{k}\) reflects the \(k^{th}\) behavior property [35]. Others have earlier studied multivariate time seriesbased analysis, which is close to CBA, but technically they are different. This work also discussed Multivariate time series, sequence analysis, and Coupled Hidden Markov Model (CHMM) as they are related to the proposed work. CHMM is a collection of Multiple HMMs. The researchers have used this work to study the abnormal behavior of a database such as a trading database. Adapting the CHMM to changes was timeconsuming and hence subsequently, the researchers developed an adaptive CHMM by adding an automatically adaptive mechanism to improve its efficiency. The key stages of CHMM/ACHMM based abnormal behavior are described in Fig. 3.
Experiments performed showed that using CHMM will outperform the use of a single HMM. Moreover, ACHMM outperformed CHMM. The researchers tested proposed algorithms in this work on a real dataset of the Asian Stock Exchange from \(1\text{st}\) June 2004 to \(31\text{st}\) December 2005. The metrics used for the evaluation of performances were

a.
\(Accuracy = \frac{(TP+TN)}{(TP+TN+FP+FN)}\)

b.
\(Precision = \frac{TP}{(TP+FP)}\)

c.
\(Recall = \frac{TP}{(TP+FN)}\)

d.
\(Specificity = \frac{TN}{(TN+FP)}\)
where TP stands for True Positive, which means that it belonged to the positive class. It is also classified as positive by algorithm; FP stands for False Positive, which means that it belonged to negative class but misclassified as positive by algorithm; TN stands for True Negative, which means that it belonged to negative class. It is also classified as negative by the algorithm. FN stands for False Negative, which means that it belonged to the positive class but was misclassified as negative by an algorithm. Tables 4 and 5 gives the accuracy values and recall values for CHMM and ACHMM, respectively, obtained from the experiments performed. The tables record values for winsize=20 minutes, and PNum stands for the number of detected abnormal activity sequences. The values of other technical parameters for both the methods can be found in [35] with their detailed analysis.
This work gave an initial insight into CB and CBA, proposed CHMM and ACHMM, and showed their performance using experimental analysis of a real dataset. The major drawback was that when applied to datasets of high dimensionality, the proposed mechanisms could not furnish expected results. Hence, the proposed mechanisms need more exploration regarding their integration with more sophisticated domain knowledge. For future research, CBA has excellent potential, challenges, and opportunities open.
As stated above [35,36,37,38] deal with a special case of correlated data i.e. behavior analysis of the stock market. Due to the data of various investors and traders are often not mutually independent, the behavioral coupling between investors–traders and investors–investors are frequently noticed. The behavioral coupling between one investor and other investors is undesirable as they together try to manipulate and earn extra higher profit and influence the trend. Hence financial market regulators are interested in detecting such couplings. The two proposed frameworks in [36] consist of the major three stages, as described in Fig. 4.
In this work, the researchers proposed algorithms for all the above three stages using multiple variants of hierarchical grouping (HG) and hierarchical clustering (HC) techniques. They were compared with the algorithms proposed in [35]. The same metrics were used to evaluate the performance in both the works and algorithms of [36] provided better results for anomaly detection. Tables 6 and 7 gives the accuracy values and recall values for the three methods, respectively, obtained from the experiment formed. The tables record values for winsize=20 minutes, and PNum stands for the number of detected abnormal activity sequences. The values of other technical parameters for both the methods can be found in [36] with their detailed analysis.
When they compared the two proposed methods to each other, then it was observed that the CBAHG framework obtained better performance in terms of all the used metrics. Nevertheless, none of the mechanisms performed satisfactorily when applied to datasets of high dimensionality. Hence, more research regarding their varied applications is needed.
Longbing Cao [39] in 2013 presented another study about nonIIDness of data which is usually present in the form of couplings and heterogeneity between objects, values, and attributes. This work initially analyzed the IIDness of several classical algorithms and showed how they might not work properly in the case of nonIID data. To measure the nonIIDness, authors in this paper have used a measure termed Coupled Item Similarity (CIS). CIS between categorical items X and Y are defined as follows:
where \(X_{j}\) and \(Y_{j}\) are the values of item feature j for X and Y, respectively and \(\delta A_{j}\) is coupled attribute value similarity (CAVS). \(\delta ^{A} _{j}\) is dependent upon the intracoupled attribute value similarity (IaAVS) \(\delta ^{Ia}_{j}\) and intercoupled attribute value similarity (IeAVS) \(\delta ^{Ie}_{j}\) as proposed in this work.
The values of \(\delta ^{Ia}_{j}\) and \(\delta ^{Ie}_{j}\) can be calculated using [1], Eqs. (4.2), (4.7) respectively. Also, in this, Longbing Cao has shown in detail that how exploring the nonIIDness allows us to comprehensively, systematically, and deeply explore couplings, correlation, heterogeneity between values, attributes, and objects. This subsequently results in more robust algorithms such as learning algorithms, pattern matching algorithms, and many more. This also profoundly explains the need to consider data as nonindependent with classic examples, but it does not state its relation with its privacy. It does not device any mechanism to improvise data privacy in the case of nonindependent data.
Proposed solutions using the sensitivity analysis approach
Chen et al. [8] in 2013, analyzed how correlated data can lead to unexpected privacy loss and proposed multiplication of global sensitivity with the no. of correlated records to deal with the issue. For this, they analyzed Social network data as there was a high correlation among data records. The proposed method successfully achieved the required privacy in the correlated dataset. However, as per the privacy and data utility tradeoff, it set the major drawback of high data utility loss. Hence, the authors of [8] were yet to achieve an optimum balance of data privacy and data utility in the case of correlated data. As used in this work,
Definition 2
Global sensitivity. For any function \(f: D \rightarrow R^{d}\), the global sensitivity of f is
for all D1, D2 s. t. \(\mid D1 \bigtriangleup D2 \mid = 1\) i.e. Two databases, D1 and D2, are neighbors if they differ on at most one record, denoted by\(\mid D1 \bigtriangleup D2 \mid = 1.\)
Then in the year 2014, the Pufferfish mechanism was proposed in [40] to give a solution to the problem of privacy loss due to correlated data, and it has already been discussed earlier in this paper.
In the consecutive years, there were also many proposed mechanisms to address the same threat. Zhu et al. [9] in 2015, intricately described problems that arose due to correlated datasets and how they compromised the privacy guarantee provided by conventional Differential Privacy, and they coined the problem as Correlated Differential Privacy. Traditional Differential Privacy never considered the relationship among records, and it greatly and adversely influenced the privacy guarantees. In this work, the researchers also explained the drawbacks of the naive solution of multiplying the global sensitivity with the no. of correlated records to ensure privacy level as proposed in [8] in detail. They clearly showed in this work that how the solution proposed in [8] induced an immense magnitude of noise which significantly damaged the utility of data. In this, Zhu et al. have categorized correlated record analysis methods. Also, in the same work, they have proposed a different sensitivity analysis method called Correlated Sensitivity which reduced the amount of noise compared to global sensitivity.
Definition 3
Correlated Sensitivity: For a query Q, the correlated sensitivity is determined by the maximal record sensitivity,
where q is a record set of all records responding to a query Q.
Definition 4
Record Sensitivity: For a given \(\delta\) and a query Q, the record sensitivity of \(r_{i}\) is
where \(\delta _{ij} \in \bigtriangleup\). The record sensitivity measures the effect on all records in D when deleting a record \(r_{i}\). \(\delta\) is the correlated degree analysis matrix that holds the degree of correlation between all sets of records. Along with it, the authors have also proposed a mechanism that satisfies Differential Privacy in the case of correlated datasets and also preserves utility for further application. Nevertheless, during the practical implementation of the proposed mechanism, it was observed that it depends on a no. of parameters. One such parameter held a tradeoff with the accuracy of the proposed mechanism, and hence its application could not be suggested.
Liu et al. [10] in the year 2016, initially stated how Differential Privacy could not give a privacy guarantee in case of correlated data and later in their work have proposed a perturbation mechanism for dependent data called the Dependent Differential Privacy (DDP) to reduce such threats. To show vulnerable threats to correlated data, Liu et al. have performed inference attack over a realworld dataset. This facilitated demonstrating how an adversary can infer sensitive information such as user location information by using the noise added query outputs and exploiting user’s social relationships, thus violating the Differential Privacy guarantees. The authors [10] measured the performance of the Inference attacks by using the following two metrics:
Dist(.) is defined in [ [10], equation 5] and H(.) denotes the entropy information of a random variable [50]. H(Di) measures the entropy of prior probability and measures the adversary’s prior information for Di without considering the social relationships.\(( H(Di)H(Di \parallel \mu D _{i})\) is the entropy of posterior probability and measures the adversary’s posterior information after the inference attack. By evaluating the Leaked Information, one can measure the privacy breaches in terms of change in the adversary’s before posterior beliefs. The above metrics applied over realworld datasets clearly showed that conventional Differential Privacy applied to a dependent dataset would disclose more sensitive information than expected and hence is a serious privacy violation issue. The proposed perturbation mechanism called DDP has added an extra parameter called the dependence coefficient to measure the dependence relationship between tuples minutely. The sensitivity under dependence relationship between two tuples \(D_{i}\) and \(D_{j}\) is proposed as:
where \(\rho _{ij} \in [0,1]\) and is a metric to measure dependence between two tuples \(D_{i}\) and \(D_{j}\). Rest they theoretically and experimentally proved the utility and privacy superiority of the proposed DDP mechanism in the current work. Furthermore, they have stated in this work that the DPP mechanism is more accurate in computing kmeans clustering centroids, SVM classifiers, publishing degree distance of largescale social networks. They have also proved in this work that the proposed mechanism is also resilient to inference attacks in the case of dependent tuples. Even conventional Differential Privacy fails to provide privacy in such a case. The key challenge of the proposed approach is to accurately compute the value of \(\rho _{ij}\) as it relies on probabilistic models of statistical data. Hence the probabilistic models must be known beforehand to calculate \(\rho _{ij}\). The overestimated value of \(\rho _{ij}\) still guarantees the privacy commitments of the proposed DDP, but its underestimation leads to degradation of the privacy commitments.
Proposed solution using information theory models
Wu et al. [12] in the year 2017 proposed Identity Differential Privacy to capture the weakness of conventional Differential Privacy when dealing with dependent data by using the foundations of Information Theory in their published work. This work also studied the connection between Information theory and Differential Privacy. With the help of mathematical proofs, they showed how concepts of information theory could well explain \(\epsilon\)Differential Privacy and how using the proposed Identity Differential Privacy one can show that mutual information between dependent records will stay less than or equal to \(\epsilon\). As given in this work,
Lemma 1
If the mechanism M satisfies \(\epsilon\)identity differential privacy, then for any source \(X_{i}\), any \(r \in R\), and any record \(t \in \chi _{i}\), there are
We can find the mathematical proof of Lemma 1 in detail in [12]. In information theory, the relative entropy is used to measure the distance between two probability distributions [51]. The relative entropy of \(X_{i}\) and \((X_{i}Y = r)\), denoted as \(D(X_{i}  (X_{i}Y = r))\), has the following result
Theorem 1
Let the mechanism M satisfies \(\epsilon\)identity differential privacy and let Y be the output random variable of the mechanism. We have
Using Lemma 1, we have
Hence, the proof is complete.
Theorem 2
Let the mechanism M satisfies \(\epsilon\)identity differential privacy and let Y be the output random variable of the mechanism. We have
Finally, in the current work, it was concluded that concepts of Information Theory are well suited to model the problems of dependent data in Differential Privacy. This work did not study about other privacy leakages i.e., whether using concepts of Information theory induces any other privacy leakages or not. Without studying about other privacy leakages its application may not be practically suggested.
Proposed solution using statistical correlation analysis method
Kumar et al. [42] in 2018 have tried to do correlation analysis over some selected datasets to enhance the accuracies when some classification problem arises. The datasets used areDepressive disorder symptom dataset for evaluating depression severity, the Local weather dataset for classifying depression severity, and the Physiological sensor dataset for emotion detection. To study the correlation analysis, the authors have used Pearson’s Product Moment correlation method. Consider two variables, A and B. Then the Pearson’s correlation coefficient can be calculated using the following formula:
where \(C_{A,B}\) is the correlation coefficient, Covariance(A,B) is the covariance, and \(\sigma _A \sigma _B\) are the standard deviations of A and B, respectively. In the case of a dataset involving two sets, {a1, a2,....,an} and {b1, b2,..., bn}, the correlation coefficient can be calculated as:
where n is the sample size, ai and bi are the ith data values, and a’, b’ are the mean values. The value of the coefficient C ranges between −1 and + 1. Values close to + 1 show a strong positive correlation, those close to − 1 show a strong negative correlation, and those closest to 0 show no relation. Then for feature selection, the stepwise selection algorithm backward elimination was used and performed by Weka Tool using the Merit parameter in this work. The classification algorithms used for classification purposes are—Random forest algorithm, Multinomial Logistic Regression, Logit Boost, and SVMs. And then accuracy is calculated by taking different numbers of features in each iteration using the Accuracy formula as in [35]. And then, an optimum number of features was selected as per the assigned ranks. The researchers must use a better correlation analysis technique as Pearson’s Product Moment could not properly define relationships among independent variables and other complex relationships. Also, more study is required to find an appropriate feature selection method.
Proposed solution using divide and conquer technique
Lv, et al. [13] in 2019, aimed to achieve a Differential Privacy mechanism for correlated big data. Their initial step was to find the correlation between data. Moreover, for the same, they have used MIC (Mutual information correlation) algorithm, based on Information Theory and Mutual Information Theory. The main reason behind selecting MIC Algorithm is, it can measure both linear and nonlinear correlation between variables [21]. The authors then proposed two Correlated Differential Privacy models for Big Data Privacy Protection (1) kCorrelated Record Differential Privacy and (2) rCorrelated Block Differential Privacy.
Definition 7
Adjacent Data Set. Let \(D_{1}\) be a big data set with n as total no. of records, including l correlated records \((l\ll n)\). Suppose when a record is modified, k1 records are changed to get dataset D2 then D1 and D2 are adjacent data sets represented as \(\mid D_{1} \bigtriangleup D_{2} \mid = k\) where \(1 \le k \le l\).
Definition 8
kCorrelated Record Differential Privacy (kCRDP). Let \(D_{1}\) and \(D_{2}\) be two adjacent data sets, i.e. \(D_{1} \bigtriangleup D_{2}= k\), A be the privacy mechanism and f be the query function. For any output \(S \in R\), kCRDP is as follows
Definition 9
rblock division. Let D be a big data set if there exists a set B = \(\{ D_{1}, D_{2},....., D_{r} \}\) and \(D_{1}, D_{2},....., D_{r}\) are independent of each other such that \(D_{1} U D_{2} U......... U D_{r} = D\), then B is called a rblock division of big data set D. In order to calculate sensitivity, the researchers used the new technique in this work, i.e., the Machine learning approach to establish dependency of correlated data records. Clearly, the algorithms used the divide and conquered approach. For the experimental analysis, they have chosen the National Air Quality Data as timeseries data are usually highly correlated. For examination purposes, the proposed mechanism is compared to mechanisms proposed in [8,9,10] by setting different parameters to suitable values wherein they MAE as a performance evaluation function in this. An obstacle in this was that Big data is often accompanied by enormous computing overhead, but the authors proposed the rCorrelated Block Differential Privacy (rCBDP) protection, model. This approach gave a solution to a large extent to the problem and improve efficiency through data partitioning and parallel computing. Concurrently, the proposed MIC kmeans algorithm improved efficiency in large distance calculation time for the kmeans clustering algorithm (Fig. 5).
The study of the influence of the block parameter r on privacy protection performance revealed that with a decrease in value of r, the amount of data correlation between subdatasets increased, and privacy protection performance decreased. Hence a large value of r would result in high privacy protection performance. However, increasing the value of r will also result in increased overhead time, and thus, parameter r must be adjusted to balance time overhead and privacy protection performance. In this work, the researchers calculated the time complexity of the proposed mechanism to be \(O(n^{2})\) where n is the size of the data block.
The simulation results showed that: (a) Privacy protection performance of the proposed mechanism is better than the other three mechanisms proposed in [8,9,10]. (b) As per the curves drawn using experimental results, the influence of different privacy budget parameters \(\epsilon\) is not stable for mechanisms provided in [8, 9]. However, its influence on the proposed method in the current work is relatively stable, and also, the mechanism of [10] is close to the mechanism proposed in this. Therefore, the performance advantage of the mechanism proposed is more than the mechanisms of [8, 9] and [10].
Some of the shortcomings of this work upon which one may further work are: (a) Examination of external correlation of datasets. Only internal correlations have been considered. (b)The Traditional MIC algorithm used in this cannot handle high dimensional data, and hence improved algorithms may be used for future works. (c) k value selection is complex in the kmeans algorithm. Other clustering algorithms may be explored and used for improved efficiency.
Proposed solution using exhaustive combinations
Zhao et al. [10] in 2019 proposed a slight modification to Differential Privacy to give a privacy guarantee for correlated data. To do so, they had considered all viable combinations by combining tuple correlations and the query responses. After that, the authors of [10] concluded by stating that the modification done to Differential Privacy, termed as Dependent Differential Privacy (DDP), facilitated privacy guarantee to any data correlation model. Moreover, the adversary’s knowledge and authors also showed how quantitatively DDP could be deduced from DP using a more robust privacy parameter.
Proposed solution using the concept of privacy leakage with time
Cao et al. [43] midway in the year 2019, tried to address the issue of privacy loss of temporal data release in their work. If data is recorded concerning time, then such a correlated dataset is said to be a temporally correlated dataset, or simply a temporal dataset [43]. The main objectives of this work were—Analysing temporal privacy leakage and quantify temporal privacy leakage. To analyze Temporal Correlations, they have used the concept of the Markov Chain. The parameter of the Markov Chain is a transition matrix. They have accordingly used transition matrices PBi and PFi, which are called backward temporal correlation and forward temporal correlation, respectively, in this work. In this, the researchers also considered it rational that the adversary may know the matrices.
Backward Privacy Leakage, BPL—The privacy leakage of \(M_{t}\) caused by \(r_{1},..., r_{t}\) w.r.t. \(A^{T}_{i}\) is called backward privacy leakage, defined as follows:
Forward Privacy Leakage, FPLThe privacy leakage of \(M_{t}\) caused by \(r^{t},..., r^{T}\) w.r.t. \(A^{T}_{i}\) is called forward privacy leakage, defined by follows:
In this work, it was already shown how TPL can be calculated using the formula
Substituting the values of BPL and FPL as calculated in (16) and (18) into equation (20):
Since the worstcase privacy leakage must be considered among all users in the database, therefore by substituting (18) and (20) in (22):
After quantifying TPL, \(\alpha\) DPT i.e. \(\alpha\) DP under temporal correlations was proposed. Using Eq. (23):
where \(\alpha _{t}^{B}\) denotes the Backward Privacy leakage, \(\alpha _{t}^{F}\) denotes the Forward Privacy leakage and \(\epsilon ^t\) is the privacy leakage due to conventional DP. Therefore,
The current work has consequently evinced the sequential composition theorem of \(\alpha\)DPT. It showed how temporal correlations only affect eventlevel privacy (i.e., privacy at a particular time t) and do not affect userlevel privacy (i.e., the privacy of each individual during the whole timeline). Then further, to calculate the mathematical values of BPL and FPL to find TPL mathematically, Cao et al. [43] have formalized the objective function and constraints. The [43] authors also showed that they could find their optimum values using the Simplex Optimization Algorithm as the objective function is a form of Linear Fractional Programming. The overall time complexity using Simplex Algorithm is \(O(n^{2}2^{n})\) as calculated in this work. Also, the discussed [43] authors several different methods to calculate the value of BPL, FPL, and TPL, such as Polynomial Algorithm, QuasiQuadratic Algorithm, Problem Formulation, SubLinear Algorithm. The relationship between \(\alpha\) and the run time of these algorithms was also studied. They have also studied the relationship between data privacy and data utility by applying the proposed mechanism for different degrees of correlations and temporal data. The future scope of work lies in studying privacy leaks under temporal correlations combined with other types of correlation models.
In 2021, authors Hemkumar et al., in their work [45] have evaluated the correlation between timestamps and location statistics and their ability to pose a threat to the data privacy guarantees. They have termed this type of correlation as temporal correlation. They studied the difference between the current timestamp and the previous timestamp and used this value to introduce distortion. As privacy algorithm, they have used differential privacy algorithm. Conventional differential privacy assumes that the location statistics are independent, but this is different from the nature of realworld datasets. Thus, the use of differential privacy for such datasets causes more privacy leakages than expected. The solution proposed in [45] adopts wevent privacy for continuously releasing location timestamp data. The proposed methodology introduces distortion in the privacy budget depending on the correlation (similarity and dissimilarity) between current and previous timestamps. The paper also evaluated the data utility values for real and synthetic data. This work lacks in the study of the correlation between other types of attributes. It is limited only to temporal correlation study, whereas the correlation between other values is also an important factor for data privacy guarantees.
Proposed solution using weighted hierarchical graph technique
Li et al. [14] in the year 2019, initially took small examples to demonstrate the effect of prior knowledge of an adversary, correlations between data, and sensitivity analysis. Then using the same examples, they have explained cases of positive correlation, negative correlation, and no correlation, along with how an adversary can use them with little or full prior knowledge. Mechanisms proposed in [46,47,48], and [41] based on the Bayesian Inference method can only be applied in case of positive correlation. The proposed Prior Differential Privacy (PDP) mechanism considers a set of distributions, unlike Bayesian Differential Privacy (BDP) mechanism, which considers a single distribution. As per the proposed mechanism, the privacy leakage caused by the adversary with the help of data correlations must be less than or equal to \(\epsilon\), which is the maximal privacy leakage caused by adversaries with public distributions. Then, to demonstrate various effects, they have used a weighted hierarchical graph (WHG) and calculated their node and edge values using the formulae specified in the current work.
Few key points became evident in finding the edge values and node values of WHG: (a) The weakest adversary can cause privacy leakage when there is a positive correlation. (b) When WHG has both positive and negative edges, the whole WHG will have to be traversed to compute the correct privacy leakage. To traverse the full WHG, they have initially used a complete space searching algorithm with time complexity \(O(n^{4}2^{n1})\) where n is the number of tuples in a database. In order to reduce this computational time complexity, a fast searching algorithm on a subspace of the whole original space the researchers proposed with little sacrifice of accuracy and time complexity of \(O(n^{4})\). To analyze the impact of the factors mentioned above on continuousvalued data, Li et al. [19] described why the WHG method would not be appropriate and instead have used the MultiVariate Gaussian Model. The results are consistent with the results of discretevalued data analysis.
Finally, with the help of numerical simulation, the following observations were noticed: (1) (a) No effect on privacy leakage when there is strongest prior knowledge. (b) In case of less strong prior knowledge as correlation increases, privacy leakage generally increases (for the positive and negative correlation). (2) Privacy leakage decreases with an increase in prior knowledge for positive correlation in discretevalued and positive and negative correlation in continuousvalued data. (3) The efficiency of the fast searching algorithm is much more than the full space search algorithm. Also, the accuracy of the fast searching algorithm is quite close to the full space searching algorithm; hence in practical applications, the fast searching algorithm is better suited. Figures 6 and 7 shows observation 1(b) and 2 of the current proposed work, respectively to describe them. The proposed analysis was applied only to linear queries in the current work. However, it should be applied to nonlinear queries as well to evaluate its performance better.
Study of correlation constraints
Hyma et al. [52] by the end of 2015, studied considering the needed various correlation constraints while ensuring privacy to a correlated dataset. They are:

a.
Simple Correlation Constraints: The entire database is analyzed to identify correlations among data. It can be the correlation either between records or between attributes.

b.
Valuebased Correlation Constraints: Based on the value of data, they scrutinized the privacy level. The value decides the impact of the correlation on the privacy level and mechanism.

c.
Attributebased Correlation Constraints: In this, they examined the associations between multiple attributes. If there is any association, then the chances are high that knowing the values of all attributes of one record will help the adversary correctly guess the value of attributes of another record if some attribute values are known, and some attribute values are unknown. Hence this holds a noticeable impact on the privacy level.

d.
Eventbased Correlation Constraints: Events in the case of database mining often refer to query occurrence. The correlation between successive events or queries holds great potential to challenge the privacy level decided for a noncorrelated system.

e.
Personalized v/s Universal Correlation Constraints: Personalised privacy is the privacy level achieved as per the record owner, and universal privacy refers to the privacy applied by the publisher after the owner to the publisher hands it over. A proper balance between the two is required to achieve the optimistic privacy of data.
These constraints were necessary to study before understanding correlation and its effect on data privacy. In this work, the researchers firmly concluded that the privacy mechanisms applied to correlated data and noncorrelated data are very different. Moreover, a precise correlation constraint mechanism is required to achieve the required level of privacy. However, this work has given theoretical study of the abovestated facts over realtime examples, and also, it lacks practical implementations of the same.
Experiment and analysis
Experimental setup
In order to analyze the correlation between the data, the researchers used Timeseries data. We chose “Air Quality Data” daily from various stations across several Indian cities. The data file contains nearly 30k records with a total of 13 columns, including city, date, air quality parameters AQI and AQI bucket classifying the air quality to an appropriate level. We chose to consider the data from 2015 to 2020, excluding the missing data. After removing the missing records, we chose to keep the city, date AQI and AQI bucket and few other air quality parameters except PM10, NH3, and Xylene. After preprocessing, the experimental dataset used had nearly 17k records with a record length of 17. Since the experimental dataset contains city records, there is a potential correlation between the dataset. We used the MIC method to calculate the correlation existing between the records. They compiled experiments and results and implemented using the TensorFlow environment on Google Colabotary, with 13GB RAM and 108GB disk. The experimental platform is a laptop with Intel®Core\(^{\mathrm{TM}}\) i58250U CPU @ 1.6GHz 1.80GHz processor, 8GB RAM, 64bit operating system, x64based processor, Windows 10. Also, Table 8 states the statistical features of the dataset.
Analysis and results
The quantitative analysis between the various proposed methods in [8,9,10] and [13] show that the methodology proposed in [13] outperforms the other methods with respect to various parameters mentioned in Table 9. For convenience, methodology of [8] has been referred as Method A, [9] as Method B, [10] as Method C, and [13] as Method D.
The results of the various methods, as mentioned earlier, have been evaluated with the help of parameters such as \(\epsilon\) values, \(\epsilon\) value interval, maximum queries (for different step sizes), and Mean Average Error values.
The Mean Average Error (MAE) is fixed at 0.5 to evaluate the methods’ performances, as aforementioned. As it is clear from the data of Table 9 that under identical conditions and \(\Delta\) \(\epsilon\) = 0.01, Method D can provide 89 queries. In contrast, Method A, Method B, and Method C can service 54, 68 and 80 queries. Also, when they compared MAE values for the four methods, Method D outperformed the other three as stated in Table 10.
Method D, as proposed in [13] explicitly considered Big Data instead of standard data. Hence, this paper attracted much of our attention, and we did a deeper study and analysis of the paper. Therefore, the quantitative analysis contains results concerning [13].
Discussion
This paper presents a review of all such works that identified data correlation as a privacy threat and tried to maintain data privacy guarantee by considering data correlation as an inherent property of realworld datasets. Also, some of the proposed algorithms reduce the amount of noise to be added so that the data utility could be maintained at the required level and ensured data privacy. We have also presented a few graphs and tables to describe results in a better way. Table 3 provides a discussion about all the reviewed articles and their methodology in a tabular format for quick understanding.
Few pieces of research have utilized Mutual Information Coefficient (MIC) to study the relationship between data in a better way. MIC is a data correlation analysis technique and a machine learning concept, which has also given some promising results. Results show that the method using correlation analysis of data, i.e., MIC, yielded better results than methods that did not consider data correlation. Exploring other data correlation analysis techniques to provide better data privacy can be an upandcoming solution to the problem.
Correlated Big Data privacy
Various frameworks have been established in recent years to ensure big data privacy. Given the massive amount of data and the combination of structured and unstructured data, some new Big Data models are a need to improve privacy and protection. One more observation is that most of the works have considered standard data to study the abovementioned issue. In contrast, realworld datasets are often large and technologically termed big data. With the increase in data size, several other parameters also change, and the correlation among data also increases. A key feature of big data is its high dimensionality which makes the application of various algorithms complex and yields unexpected results. High dimensionality also results in high data correlation. If gone unnoticed, this can be a significant privacy breach, and hence understanding and analyzing data correlation is a fundamental step towards ensuring data privacy guarantees for big data.
Machine learning for correlated Big Data privacy
Researchers have provided various measures to realize data correlation as a data privacy threat and have proposed various methodologies to deal with it. One of the promising methods suggested by the authors of [13] is the use of MIC to calculate data correlation within the dataset. Tables 9 and 10 show the experimental results and conclude that using MIC for correlated data privacy is superior to the other methods. This paved the way for considering other data correlation analysis techniques to do the same. Also, this made other researchers believe that other machine learning tools can help provide solutions to the problem mentioned above of data privacy threat.
Future scope

Research regarding data correlation as a privacy threat for big data is comparatively less. Hence, there lies a dire need to study data correlation present in big datasets and explore it further to ensure its privacy.

More methods of calculating the correlation between data may be studied and applied to datasets, and their application to big datasets must be analyzed.

More efficient mechanisms for providing correlated data privacy protection may be explored and applied.

Other machine learning concepts must be explored and applied for ensuring data privacy for correlated datasets and correlated big datasets.
Conclusion

This work is a study of all the articles and research papers that identified data correlation as a data privacy threat and attempted to provide solutions. Table 3 provides a discussion about all the reviewed articles and their methodology in a tabular format for quick understanding.

MIC is a data correlation analysis technique and a machine learning concept. The researchers successfully used it to study and understand data correlation and consequently provide a solution for the data mentioned above privacy threat. We considered this as the most crucial finding because this introduced an effective way to calculate data correlation. Also, this is important as this paved the way for more research in using machine learning for data privacy.

Realworld datasets are often large and high dimensional. Notwithstanding, most of the research is done on routine datasets. Hence, the researchers must explore data correlation as a threat to big data privacy in a better form.
Availability of data and materials
All relevant research data and materials are available with the authors.
References
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M. Ldiversity: Privacy beyond kanonymity. ACM Trans Knowl Discov Data. 2007;1(1):3. https://doi.org/10.1145/1217299.1217302.
Li N, Li T, Venkatasubramanian S. tcloseness: Privacy beyond kanonymity and ldiversity. In: 2007 IEEE 23rd International Conference on Data Engineering; 2007. p. 106–15. https://doi.org/10.1109/ICDE.2007.367856.
Dwork C. Differential privacy. In: 33rd International Colloquium on Automata, Languages and Programming, Part II (ICALP 2006). Lecture Notes in Computer Science, vol. 4052, pp. 1–12. Springer
Yang X, Wang T, Ren X, Yu W. Survey on improving data utility in differentially private sequential data publishing. IEEE Trans Big Data. 2017. https://doi.org/10.1109/TBDATA.2017.2715334.
Jain P, Gyanchandani M, Khare N. Big data privacy: a technological perspective and review. J Big Data. 2016. https://doi.org/10.1186/s405370160059y.
Wang Y, Song S, Chaudhuri K. Privacypreserving analysis of correlated data. ArXiv arXiv:abs/1603.03977 2016.
Chen J, Ma H, Zhao D, Liu L. Correlated differential privacy protection for mobile crowdsensing. IEEE Trans Big Data. 2017. https://doi.org/10.1109/TBDATA.2017.2777862.
Chen R, Fung B, Yu P, Desai B. Correlated network data publication via differential privacy. VLDB J. 2014;23:653–76. https://doi.org/10.1007/s0077801303448.
Zhu T, Xiong P, Li G, Zhou W. Correlated differential privacy: Hiding information in noniid data set. IEEE Trans Inf Foren Security. 2015;10(2):229–42. https://doi.org/10.1109/TIFS.2014.2368363.
Zhao J, Zhang J, Poor HV. Dependent differential privacy for correlated data, 2017;pp. 1–7. https://doi.org/10.1109/GLOCOMW.2017.8269219
Kifer D, Machanavajjhala A. No free lunch in data privacy. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. SIGMOD ’11, pp. 193–204. Association for Computing Machinery, New York, NY, USA, 2011. https://doi.org/10.1145/1989323.1989345.
Wu G, Xia X, He Y. Extending differential privacy for treating dependent records via information theory, 2017.
Lv D, Zhu S. Achieving correlated differential privacy of big data publication. Computers Security. 2019. https://doi.org/10.1016/j.cose.2018.12.017.
Li Y, Ren X, Yang S, Yang X. Impact of prior knowledge and data correlation on privacy leakage: A unified analysis. IEEE Trans Inf For Sec. 2019;14(9):2342–57. https://doi.org/10.1109/TIFS.2019.2895970.
Sunil K, Iliyoung C. Correlation analysis to identify the effective data in machine learning: Prediction of depressive disorder and emotion states. Int J Environ Res Public Health. 2018. https://doi.org/10.3390/ijerph15122907.
Reshef D, Reshef Y, Finucane H, Grossman S, McVean G, Turnbaugh P, Lander E, Mitzenmacher M, Sabeti P. Detecting novel associations in large data sets. Science (New York, NY). 2011;334:1518–24. https://doi.org/10.1126/science.1205438.
Pandey R, Dhoundiyal M, Kumar A. Correlation analysis of big data to support machine learning. Big Data. 2015. https://doi.org/10.1109/CSNT.2015.32.
Moraru A, Pesko M, Porcius M, Fortuna C, Mladenić D. Using machine learning on sensor data. CIT. 2010. https://doi.org/10.2498/cit.1001913.
Namuduri S, Narayanan BN, Davuluru VSP, Burton L, Bhansali S. Deep learning methods for sensor based predictive maintenance and future perspectives for electrochemical sensors. J Electrochem Soc. 2020;167(3):037552. https://doi.org/10.1149/19457111/ab67a8.
Moraru A, Pesko M, Porcius M, Fortuna C, Mladenic D. Using machine learning on sensor data. In: Proceedings of the ITI 2010, 32nd International Conference on Information Technology Interfaces, 2010;pp. 573–578.
Liang JY, Feng CJ, Song P. A survey on correlation analysis of big data, Big Data. 2016;39, 1–18. https://doi.org/10.11897/SP.J.1016.2016.00001
MC Kennel. A survey on correlation analysis of big data. Big Data. 2016; 39, 1–18. https://doi.org/10.11897/SP.J.1016.2016.00001
Priyank J, Manasi G, Nilay K. Big data privacy: a technological perspective and review. J Big Data. 2016. https://doi.org/10.1186/s405370160059y.
Priyank J, Manasi G, Nilay K. Enhanced secured map reduce layer for big data privacy and security. J Big Data. 2019. https://doi.org/10.1186/s4053701901934.
Zhu XX, Tuia D, Mou L, Xia GS, Zhang L, Xu F, Fraundorfer F. Deep learning in remote sensing: a comprehensive review and list of resources. IEEE Geosci Rem Sens Magazine. 2017;5(4):8–36. https://doi.org/10.1109/MGRS.2017.2762307.
Maggiori E, Tarabalka Y, Charpiat G, Alliez P. Convolutional neural networks for largescale remote sensing image classification. IEEE Trans Geosci Remote Sens. 2017;55:645–57. https://doi.org/10.1109/tgrs.2016.2612821.
Zhong L, Hu L, Zhou H. Deep learning based multitemporal crop classification. Remote Sens Environ. 2019;221:430–43. https://doi.org/10.1016/j.rse.2018.11.032.
Ce Zhang XP. Isabel Sargent: Joint Deep Learning for land cover and land use classification. Rem Sens Environ. 2019;221:173–87. https://doi.org/10.1016/j.rse.2018.11.014.
Ma L, Liu Y, Zhang X, Ye Y, Yin G, Johnson BA. Deep learning in remote sensing applications: A metaanalysis and review. ISPRS J Photogrammetry Remote Sens. 2019;152:166–77. https://doi.org/10.1016/j.isprsjprs.2019.04.015.
Liu X, Han F, Ghazali KH, Mohamed II, Zhao Y. A review of convolutional neural networks in remote sensing image. In: Proceedings of the 2019 8th International Conference on Software and Computer Applications. ICSCA ’19, vol. 5, pp. 263–267. Association for Computing Machinery, New York, NY, USA, 2019. https://doi.org/10.1145/3316615.3316712.
Youssef R, Aniss M, Jamal C. Machine learning and deep learning in remote sensing and urban application: A systematic review and metaanalysis. In: Proceedings of the 4th Edition of International Conference on GeoIT and Water Resources 2020, GeoIT and Water Resources 2020. GEOIT4W2020, p. 5. Association for Computing Machinery, New York, NY, USA, 2020. https://doi.org/10.1145/3399205.3399224.
Kantarcioglu M, Ferrari E. Research challenges at the intersection of big data, security and privacy. Front Big Data. 2019;2:1. https://doi.org/10.3389/fdata.2019.00001.
Haina Ye MY, Xinzhou C. A survey of security and privacy in big data. Big Data. 2016. https://doi.org/10.1109/ISCIT.2016.7751634.
Gehrke J, Lui E, Pass R. Towards privacy for social networks: A zeroknowledge based definition of privacy. In: Ishai, Y. (ed.) Theory of Cryptography, 2011;pp. 432–449.
Cao L, Ou Y, Yu P. Coupled behavior analysis with applications. Knowledge Data Eng IEEE Trans. 2012;24:1–1. https://doi.org/10.1109/TKDE.2011.129.
Song Y, Cao L, Wu X, Wei G, Ye W, Ding W. Coupled behavior analysis for capturing coupling relationships in groupbased market manipulations. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012. https://doi.org/10.1145/2339530.2339683.
Brand M, Oliver N, Pentland A. Coupled hidden markov models for complex action recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition 0, 994, 1997. https://doi.org/10.1109/CVPR.1997.609450
Ghosh A, Kleinberg R. Inferential privacy guarantees for differentially private mechanisms. CoRR, 2016. arXiv:1603.01508.
Cao L. Noniidness learning in behavioral and social data. Computer J. 2013;57:1358–70. https://doi.org/10.1093/comjnl/bxt084.
Kifer D, Machanavajjhala A. Pufferfish: A framework for mathematical privacy definitions. ACM Trans Database Syst. 2014. https://doi.org/10.1145/2514689.
Yang B, Sato I, Nakagawa H. Bayesian differential privacy on correlated data. 2015.
Kumar S, Chong I. Correlation analysis to identify the effective data in machine learning: Prediction of depressive disorder and emotion states. International Journal of Environmental Research and Public Health, 2018;15(12). https://doi.org/10.3390/ijerph15122907
Cao Y, Yoshikawa M, Xiao Y, Xiong L. Quantifying differential privacy under temporal correlations. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), 2017;pp. 821–832. https://doi.org/10.1109/ICDE.2017.132
Li N, Qardaji W, Su D, Wu Y, Yang W. Membership privacy: A unifying framework for privacy definitions. In: Proceedings of the 2013 ACM SIGSAC Conference on Computer & ; Communications Security. CCS ’13, pp. 889–900. Association for Computing Machinery, New York, NY, USA, 2013. https://doi.org/10.1145/2508859.2516686.
Hemkumar D, Ravichandra S, Somayajulu DVLN. Impact of data correlation on privacy budget allocation in continuous publication of location statistics. PeertoPeer Network Appl. 2021;14(3):1650–65. https://doi.org/10.1007/s12083021010786.
Kifer D, Machanavajjhala A. A rigorous and customizable framework for privacy. In: Proceedings of the 31st ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems. PODS ’12, pp. 77–88. Association for Computing Machinery, New York, NY, USA, 2012. https://doi.org/10.1145/2213556.2213571.
Lee J, Clifton C. Differential identifiability. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’12, pp. 1041–1049. Association for Computing Machinery, New York, NY, USA, 2012. https://doi.org/10.1145/2339530.2339695.
Cover TM, Thomas JA. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). New York: WileyInterscience; 2006.
Chen J, Ma H, Zhao D, Liu L. Correlated differential privacy protection for mobile crowdsensing. IEEE Trans Big Data. 2021;7:4. https://doi.org/10.1109/TBDATA.2017.2777862.
Cover TM, Thomas JA. Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing,2006). WileyInterscience
Wang C, Cao L, Wang M, Li J, Wei W, Ou Y. Coupled nominal similarity in unsupervised learning. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM ’11, pp. 973–978. Association for Computing Machinery, New York, NY, USA, 2011. https://doi.org/10.1145/2063576.2063715.
Janapana H, Prasad PVGD, Damodaram A. A study of correlation impact on privacy preserving data mining. Int J Computer Appl. 2015;129:22–5. https://doi.org/10.5120/ijca2015907152.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
All authors have equally contributed to the building of this survey paper. All authors read and approved the final manuscript.
Authors' information
Sreemoyee Biswas is currently pursuing a Ph.D. in Computer Science and Engineering from Maulana Azad National Institute of Technology, Bhopal, India. Her field of research is “Big Data Privacy.” Other areas of specialization include Data Privacy, Information Security, and Machine Learning. She has about two years of experience as an Assistant Professor. Her Educational Qualification is M.Tech & B.E. in Computer Science and Engineering. Ms. Sreemoyee Biswas has publications in Scopus Journals & National Conference.
Dr. Nilay Khare is working as Professor in MANIT Bhopal. He has more than 21 years of experience. His Educational Qualification is a Ph.D. in Computer Science & Engineering. Dr. Nilay Khare’s areas of Specialization are Big Data, Big Data Privacy & Security, Wireless Networks, Theoretical computer science. He has publications in 54 International and National Conferences and International Journal. He is a Life Member of ISTE.
Dr. Pragati Agrawal is working as Assistant Professor in MANIT Bhopal. She has more than five years of experience. Her Educational Qualification is a Ph.D. in Computer Science & Engineering. Dr. Pragati Agrawal’s areas of Specialization are Theoretical Computer Science, Energy Efficiency. Dr. Pragati Agrawal’s publications are in International and National Conferences and International Journal. She is a Life Member of IEEE and ACM.
Dr. Priyank Jain is working as an Assistant Professor in IIIT Bhopal. He has more than ten years of experience as an Assistant Professor and Research Scholar. His Ph.D. is in the “Big Data” field. He has experience from the Indian Institute of Management, Ahmedabad, India (IIMA) in the research field. His Educational Qualification is M.Tech & BE in Information Technology. Mr. Priyank Jain’s areas of specialization are Big Data, Big Data Privacy & Security, data mining, PrivacyPreserving, & Information Retrieval. Mr. Priyank Jain has publications in various International Conference, International SCI, SCIE, and Scopus Journals & National Conference. He is a member of HIMSS.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
All authors have given consent for publication of the matter.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Biswas, S., Khare, N., Agrawal, P. et al. Machine learning concepts for correlated Big Data privacy. J Big Data 8, 157 (2021). https://doi.org/10.1186/s4053702100530x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4053702100530x
Keywords
 Big Data privacy
 Correlated datasets
 Data correlation
 Machine learning
 Correlated Big Data
 Data privacy threats
 Data privacy algorithms