In this paper, a case study is presented on predicting patients with Alzheimer’s disease risk. Unfortunately, none of the previously presented studies in literature provides sufficient data to perform our study. Therefore and instead of collecting data from literature, a simulated dataset is generated. Other studies in literature were based on simulated datasets such as Tresch et al. [44], Giglio et al. [45], Murray et al. [46]. To define the predictors of the current study, we were based on previous studies highlighting the importance of physical exercises, feeding, quality of life and existence of a parent with Alzheimer disease. Based on those factors, the aim of the study is to give a percentage of Alzheimer’s disease risk for each individual in a population. Since, multiple predictors are involved and the outcome we aim to obtain is quantitative, multiple regression is the most suited statistical model to perform the study [16]. The theoretical bases of regression are explained thereafter. The steps undertaken in this study are presented in Fig. 1.
The regression analysis is a statistical model that indicates how the variables are related on the basis of an equation. Formally, the variable we are trying to predict is called dependent variable, the variable or variables to predict the value of the dependent variable are called independent variables (predictors). The simple regression is a regression with single independent variable. The multiple regression is a regression with multiple independent variables. The procedures to accomplish simple and multiple regression are in somehow similar.
The simple regression
Assuming the case where the Alzheimer’s disease risk is predicted on the base of one predictor, for instance, the age of a patient. The population undertaken in this study is a population of patients recorded in a dataset. The aim is to predict the percentage of Alzheimer’s disease risk denoted y on the base of the patient age denoted x1.
The Eq. (1) describes the relation binding x and y with an error term denoted ɛ, corresponds to a regression model. The model used in a simple regression is written as follows:
$$y = \beta_{0} + \beta_{1} x + \varepsilon$$
(1)
\(\beta_{0}\) and \(\beta_{1}\) correspond to the parameters of the population and ɛ is a random variable called the error term. The error term takes into account the variability that is not explained by the linear relation between x and y.
The patient population can be seen as the set of subpopulations related to a given value of x. Thus, one of the subpopulations consists of all patients that already reached the 60s. Each sub-population has a particular distribution of y. Thus a distribution of y is associated with the patients that already reached the sixties. Each distribution of y values has its own mean or mathematical expectation. The equation which describes how the average or the mathematical expectation of y, denoted E(x), is related to x, is called the regression equation. The regression equation is written as follows:
$$E(x) = \beta_{0} + \beta_{1} x$$
(2)
\(\beta_{0}\) and \(\beta_{1}\) are unknown parameters. Subsequently, we will use the statistical procedure named the least squares estimation to estimate the values of \(\beta_{0}\) and \(\beta_{1}\). Sample statistics b0 and b1 are sample statistics used to estimate \(\beta_{0}\) and \(\beta_{1}\).
The multiple regression
Assuming the case where the Alzheimer’s disease risk is predicted on the base of several predictors, for instance, the age of a patient, the geographical area, the number of work hour, the physical exercises ‘hours, the existence of a parent with Alzheimer’s disease, the feeding, and the existence of Lyme disease risk. The population undertaken in this study is a population of patients recorded in a dataset. The aim is to predict the percentage of Alzheimer’s disease risk denoted on the base of the predictors pinpointed out above.
The Eq. (3) that describes the relation binding xi and y with an error term denoted ɛ, corresponds to a regression model. The model used in a multiple regression is written as follows:
$$y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \cdots + \beta_{k} x_{k} + \varepsilon$$
(3)
The equation which describes how the average or the mathematical expectation of y, denoted E(x), is related to xi, is called the regression equation. The regression equation is written as follows:
$$E(x) = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \cdots + \beta_{k} x_{k}$$
(4)
\(\beta_{i \in [0,k]}\) are unknown parameters. Subsequently we will use the statistical procedure named the least squares estimation to estimate the values of \(\beta_{i \in [0,k]}\). The statistics \(b_{i \in [0,k]}\) are sample statistics used to estimate \(\beta_{i \in [0,k]}\).
The Alzheimer’s disease prediction statistical model
As we pointed out earlier in this paper, we believe that seven predictors have a great impact on predicting the Alzheimer’s disease risk. The first predictor denoted x1 is the age of an individual. The second predictor denoted x2 is the geographical area. The third one denoted x3 is the work’s hours per a day. The fourth one denoted x4 is the physical exercises’ hours. The fifth one denoted x5 is the existence of a parent with Alzheimer’s disease. The sixth one denoted x6 is the quality of feeding. The seventh and the last predictors denoted x7 is the existence of Lyme disease. In conducting a statistical study, we would like to answer the following questions: do these variables really impact the Alzheimer’s disease risk? Is there a relationship between the variables? If so, define this relationship. Can the values of these parameters be adjusted in order to efficiently predict the Alzheimer’s disease risk?
Let assume that x1i is the random variable associating the age to an individual i. x2i is the random variable associating a number indicating an area to an individual i. x3i is the random variable associating a number indicating the work’s hours per a day to an individual i. x4i is the random variable associating a number indicating the work’s hours per a day to an individual i. x5i is the random variable associating a number indicating the existence or absence of a parent with Alzheimer disease for an individual i. x6i is the random variable associating a number indicating the quality of feeding of an individual i. x7i is the random variable associating a number indicating the existence of Lyme disease for an individual i. The regression model that describes the studies is as follows:
$$y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} + \beta_{5} x_{5} + \beta_{6} x_{6} + \beta_{7} x_{7} + \varepsilon$$
(5)
Throughout this paper, the steps undertaken to estimate the unknown parameters \(\beta_{i \in [0,7]}\) are explained in details. The next section explains the sampling stage.
The Alzheimer’s disease prediction sampling stage
The sampling stage is a fundamental stage that has a great impact on the accuracy of the prediction model. Indeed, a small sample or a sample with similar individuals could lead to an inaccurate model [42]. Thus, sampling efficiently means predict efficiently. To tackle the problem of small samples a great number of statistical methods renowned for predicting efficiently based on small sample such as Hurvich and Tsai [47]. In addition, the central limit theorem could be applied when the population is large. This theorem states that the sampling distribution of the sample mean can be approximated by a normal probability distribution in the case of large sample. In practice, the sampling distribution can be approximated by a normal distribution when the sample size is greater than or equal to 30 [42].
To proceed the sampling stage, the proposed solution randomly picks up an individual. Then, compare it with the previously picked ones. If it is not similar or has close proprieties to any previously picked individual, it is added to the sample. The pseudo-code below details the steps token to accomplish the sampling stage.
The function notsimilar takes a patient as an income and returns a Boolean as an outcome. The function compares each attribute of the income to the attributes of the sample if there is any similarity the function returns false. Otherwise it returns true.
The Adjoint method for the least squares estimation problem
To estimate the unknown parameters \(\beta_{i \in [0,k]}\) Least Squares Estimation is the most common method used [20]. The QR factorization solve the problem of ordinary least squares [20]. The reference [20] relates step by step Least Squares Estimation method. Briefly, estimating the unknown parameters \(\beta_{i \in [0,k]}\) is equivalent to solve k equations system with k unknowns. In our case and in contrast with the literature the Adjoint method is used to solve the k equations system. As a matter of fact, the system of equations can be expressed in a compact form by using matrix notation. The notation is as follows:
$$A = \left[ {\begin{array}{*{20}c} n & {\sum\nolimits_{i = 1}^{n} {x_{2i} } } & \ldots & {\sum\nolimits_{i = 1}^{n} {x_{ki} } } \\ {\sum\nolimits_{i = 1}^{n} {x_{2i} } } & {\sum\nolimits_{i = 1}^{n} {x_{2i}^{2} } } & \ldots & {\sum\nolimits_{i = 1}^{n} {x_{2i} x_{ki} } } \\ \vdots & \vdots & \ddots & \vdots \\ {\sum\nolimits_{i = 1}^{n} {x_{ki} } } & {\sum\nolimits_{i = 1}^{n} {x_{ki} x_{2i} } } & \ldots & {\sum\nolimits_{i = 1}^{n} {x_{ki}^{2} } } \\ \end{array} } \right],B = \left[ {\begin{array}{*{20}c} {b_{0} } \\ {b_{1} } \\ \vdots \\ {b_{k} } \\ \end{array} } \right],Y = \left[ {\begin{array}{*{20}c} {\sum\nolimits_{i = 1}^{n} {y_{i} } } \\ {\sum\nolimits_{i = 1}^{n} {x_{2i} y_{i} } } \\ \vdots \\ {\sum\nolimits_{i = 1}^{n} {x_{ki} y_{i} } } \\ \end{array} } \right]$$
where n denotes the sample size and where: A · B = Y
This part of code was tested against large scale data to discover its limits. Unfortunately, this method suffers from shortcomings when the patients ‘sample is large and when the number of predictors is colossal. To overcome those shortcomings a new computational approach is presented thereafter. This method is massively parallel to absorb the massive calculations and to increase the method performance.
MR-AM: MapReduce with Adjoint method
MapReduce is a programming model for data processing [48]. It enables distributed algorithms in parallel on clusters of machines with varied features. MApReduce also handles the parallel computation issues thus the users deploy their efforts on programming model. Since its advent MapReduce has gained popularity in both scientific community and firms due to its effectiveness in parallel processing [49]. Indeed, the parallelization of QR factorization and SVD matrix decomposition methods is a relevant example of the scientific community interest toward MapReduce. The authors of Benson et al. [50] reported the matrix decomposition methods implemented on MapReduce programming. As pointed out earlier, the QR factorization is the most common method used to solve the least squares estimation problem. To the best of our knowledge, the Adjoint method has not been yet implemented on MapReduce framework. Thus, in this paper an implementation of Adjoint method on MapReduce is detailed in the aim to solve the least squares estimation problem.
Working within map reduce requires redesigning the traditional algorithms. As a matter of fact, the computation is expressed as two phases: Map and reduce. Each phase has key-value pairs as input and output. Two functions should also be specified: the map function and the reduce function. The types of key-value pairs may be chosen by the programmer.
A MapReduce-based Adjoint method (MR-AM) is proposed by this paper to make conventional Adjoint method work effectively in distributed environment. Our method has two steps. The following part describes in detail the two steps of our method.
MapReduce breaks the processing into two phases: The map phase and the reduce phase. Each phase has (key, value) pairs as input and output. In the current study, a text input format represents each line in the dataset as a text value. The key is the first number departed by a plus sign from the reminder of the line. Consider the following sample lines of input data:
$$\begin{aligned} &0 + 067 - 011 - 95 \ldots \hfill \\ & 0 + 143 - 101 - 22 \ldots \hfill \\ &1 + 243 - 011 - 22 \ldots \hfill \\ & \vdots \, \ddots \quad \quad \quad\ddots \quad\quad \vdots \\ & \vdots \, \ddots \quad \quad \quad\ddots \quad\quad \vdots \\ & \vdots \, \ddots \quad \quad \quad \ddots \quad \quad \vdots \\ &4 + 340 - 310 - 12 \ldots \hfill \\ &4 + 44 - 301 - 265 \ldots \hfill \\ \end{aligned}$$
The keys is the line numbers of the A matrix. The map function calculates the determinant for B matrix. The output of the Map function is as follows:
$$\begin{aligned} &(0,1 \, 0) \hfill \\ &(0, \, 22) \hfill \\ &(1, \, 11) \hfill \\ &\vdots \, \ddots \, \vdots \hfill \\ &(4, \, 78) \hfill \\ &(4, \, 80) \hfill \\ \end{aligned}$$
The pseudo code of Map Function is as follows:
The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key. So, continuing the example, our reduce function sees the following input:
$$\begin{aligned} &(0, \, [10, \, 22]) \hfill \\ & (1, \, [11,111]) \hfill \\ &\, \vdots \, \ddots \, \vdots \hfill \\ &\, \vdots \, \ddots \, \vdots \hfill \\ &\, \vdots \, \ddots \, \vdots \hfill \\ &(4, \, [78,80]) \hfill \\ \end{aligned}$$
The reduce function returns (i, βi) as output. The output of the reduce function is as follows:
$$\begin{aligned} \left( {0, \, 2} \right) \hfill \\ \left( {1, \, 5} \right) \hfill \\ \left( {2, \, 6.5} \right) \hfill \\ \begin{array}{*{20}c} {\left( {3, \, 7} \right)} \\ {\left( {4, \, 8} \right)} \\ \end{array} \hfill \\ \end{aligned}$$
The pseudo code of Reduce Function is as follows: