A MapReducebased Adjoint method for preventing brain disease
 Manal Zettam^{1}Email author,
 Jalal Laassiri^{1} and
 Nourddine Enneya^{1}
 Received: 4 May 2018
 Accepted: 25 July 2018
 Published: 2 August 2018
Abstract
In this paper, we present a statistical model performed on the basis of a patient dataset. This model predicts efficiently the brain disease risk. Multiple regression was used to build the statistical model. The least squares estimation problem usually used to estimate the parameters of regression model is solved via parallelized algebraic Adjoint method. As the parallelized algebraic Adjoint method is not the only Mapreducebased method used to solve the least square problem, experimentations were carried out to classify the Adjoint method amongst the other methods. The calculated job completion time shows the competitive trait of the Mapreducebased Adjoint method.
Keywords
 Brain disease
 Adjoint method
 Multiple regression
 MapReduce
Introduction
Further studies highlight the relevance of physical exercises and diet to prevent Alzheimer’s disease [8–11]. Moreover, the relationship between burden and Alzheimer’s disease is pinpointed in Bu et al. [12], the one binding bacterial infection and Alzheimer’s disease is identified in Maheshwari and Eslick [13] and finally the one relating the Lyme and Alzheimer’s diseases is reported in MacDonald [14].
To the best of our knowledge, existing studies of Alzheimer’s disease prediction do not rely on a software solution based on factors such as ages, daily work’s hours, and the existence of a parent with Alzheimer’s disease. Therefore, we propose a solution which receives a dataset of patients with a variable number of attributes and then constructs a statistical model to spot eventual Alzheimer’s disease patients. We also parallelize the Adjoint method via MapReduce. The parallelized algebraic Adjoint method has been presented briefly for the first time by our previous work in Zettam et al. [15].
The proposed solution estimates the Alzheimer’s disease risk based on a statistical model. Statistical models for prediction can be discerned in three main classes: regression, classification, and neural networks [16].
Regression analysis is one of the most predominant empirical tools. It is used to predict the unknown value of a variable from the known value of one or more variables also called the predictors [17]. The simple, multiple and logistics regression are the most used forms of regression in the literature [18]. The adequate choose of the regression model form depends on the number of predictors and the type of the outcome variable. The book referenced in Hosmer and Lemeshow [19] presents a detailed overview of logistic regression and its applications. In their part, the references [20, 21] give detailed overviews of simple and multiple regressions with examples of their applications in real life problems. In medical field, several studies used the regression model such as predicting longterm mortality in oesophageal [22] and relative survival in cancer registries [23].
Classification has two distinct meanings. The first type is known as unsupervised learning (or clustering), the second as supervised learning [17]. In the statistical literature, supervised learning is usually referred to as discrimination, by which is meant the establishing of the classification rule from given correctly classified data [17]. Chatap and Shrivastava [24] presented a detailed survey on classification methods involved in medical field such as the CART method [25], The CSO decision tree algorithm [26], Chi squared automated interaction detection [27], Quick, Unbiased, Efficient, Statistical Tree (QUEST) [28], Discriminate Analysis [29]. Further information can be found in Michie et al. [17].
The term neural network encompasses a large class of models and learning methods. Neural network method is a nonlinear statistical model. Neural network was developed decades ago by scientists attempting to model the learning process of human brain [30]. The most known method of neural network is called the single hidden layer backpropagation network. The discovery of back propagation in the late 80s by Rumelhart et al. [31] was an impetus to the adoption of neural network in several fields such as medical field. In this field, the neural network methods have proven their efficiency as a diagnosing tool. Indeed, since the study performed by Szolovits et al. [32] many studies have been published such as colorectal cancer [33], multiple sclerosis lesions [34], colon cancer [35], pancreatic disease [36], gynecological diseases [37], and early diabetes [38]. Readers may refer to Amato et al. [39] for more details.
Other statistical models which not fit in the three main classes are used in the prediction literature such as those presented in CesaBianchi and Lugosi [40] and Chen et al. [41]. Those models differ from the ones we presented above.
As stated before, the choice of a suitable statistical model depends on the type of predictors and the nature of the outcome. Furthermore, the use of variance analysis instead of regression to provide a quantitative outcome is a common issue pointed out by a number of statisticians such as Anderson et al. [42], Tribout [43]. These authors clearly report the main differences between regression and variance analysis. In addition, the reference [43] claims that some of software solutions aiming at facilitating their use combine regression and variance analysis under the acronym ANOVA.
In this study, the regression model is used to perform the prediction model due to the nature of predictors and outcome variable. The rest of the paper is organized as follows. The second section addresses the case study. This section is discerned in many subsections that present the variables used for modeling, detail the sampling stage, relate the application of multiple regression, give a brief overview of the Adjoint method used to solve the least squares estimation problem and introduce the MRAM method. Then, the third section presents the technique used to evaluate the strength of the resulting model. Finally, the last section sums up the current work.
The Alzheimer’s disease prediction case study
The regression analysis is a statistical model that indicates how the variables are related on the basis of an equation. Formally, the variable we are trying to predict is called dependent variable, the variable or variables to predict the value of the dependent variable are called independent variables (predictors). The simple regression is a regression with single independent variable. The multiple regression is a regression with multiple independent variables. The procedures to accomplish simple and multiple regression are in somehow similar.
The simple regression
Assuming the case where the Alzheimer’s disease risk is predicted on the base of one predictor, for instance, the age of a patient. The population undertaken in this study is a population of patients recorded in a dataset. The aim is to predict the percentage of Alzheimer’s disease risk denoted y on the base of the patient age denoted x_{1}.
The multiple regression
Assuming the case where the Alzheimer’s disease risk is predicted on the base of several predictors, for instance, the age of a patient, the geographical area, the number of work hour, the physical exercises ‘hours, the existence of a parent with Alzheimer’s disease, the feeding, and the existence of Lyme disease risk. The population undertaken in this study is a population of patients recorded in a dataset. The aim is to predict the percentage of Alzheimer’s disease risk denoted on the base of the predictors pinpointed out above.
The Alzheimer’s disease prediction statistical model
As we pointed out earlier in this paper, we believe that seven predictors have a great impact on predicting the Alzheimer’s disease risk. The first predictor denoted x_{1} is the age of an individual. The second predictor denoted x_{2} is the geographical area. The third one denoted x_{3} is the work’s hours per a day. The fourth one denoted x_{4} is the physical exercises’ hours. The fifth one denoted x_{5} is the existence of a parent with Alzheimer’s disease. The sixth one denoted x_{6} is the quality of feeding. The seventh and the last predictors denoted x_{7} is the existence of Lyme disease. In conducting a statistical study, we would like to answer the following questions: do these variables really impact the Alzheimer’s disease risk? Is there a relationship between the variables? If so, define this relationship. Can the values of these parameters be adjusted in order to efficiently predict the Alzheimer’s disease risk?
Throughout this paper, the steps undertaken to estimate the unknown parameters \(\beta_{i \in [0,7]}\) are explained in details. The next section explains the sampling stage.
The Alzheimer’s disease prediction sampling stage
The sampling stage is a fundamental stage that has a great impact on the accuracy of the prediction model. Indeed, a small sample or a sample with similar individuals could lead to an inaccurate model [42]. Thus, sampling efficiently means predict efficiently. To tackle the problem of small samples a great number of statistical methods renowned for predicting efficiently based on small sample such as Hurvich and Tsai [47]. In addition, the central limit theorem could be applied when the population is large. This theorem states that the sampling distribution of the sample mean can be approximated by a normal probability distribution in the case of large sample. In practice, the sampling distribution can be approximated by a normal distribution when the sample size is greater than or equal to 30 [42].
The function notsimilar takes a patient as an income and returns a Boolean as an outcome. The function compares each attribute of the income to the attributes of the sample if there is any similarity the function returns false. Otherwise it returns true.
The Adjoint method for the least squares estimation problem
This part of code was tested against large scale data to discover its limits. Unfortunately, this method suffers from shortcomings when the patients ‘sample is large and when the number of predictors is colossal. To overcome those shortcomings a new computational approach is presented thereafter. This method is massively parallel to absorb the massive calculations and to increase the method performance.
MRAM: MapReduce with Adjoint method
MapReduce is a programming model for data processing [48]. It enables distributed algorithms in parallel on clusters of machines with varied features. MApReduce also handles the parallel computation issues thus the users deploy their efforts on programming model. Since its advent MapReduce has gained popularity in both scientific community and firms due to its effectiveness in parallel processing [49]. Indeed, the parallelization of QR factorization and SVD matrix decomposition methods is a relevant example of the scientific community interest toward MapReduce. The authors of Benson et al. [50] reported the matrix decomposition methods implemented on MapReduce programming. As pointed out earlier, the QR factorization is the most common method used to solve the least squares estimation problem. To the best of our knowledge, the Adjoint method has not been yet implemented on MapReduce framework. Thus, in this paper an implementation of Adjoint method on MapReduce is detailed in the aim to solve the least squares estimation problem.
Working within map reduce requires redesigning the traditional algorithms. As a matter of fact, the computation is expressed as two phases: Map and reduce. Each phase has keyvalue pairs as input and output. Two functions should also be specified: the map function and the reduce function. The types of keyvalue pairs may be chosen by the programmer.
A MapReducebased Adjoint method (MRAM) is proposed by this paper to make conventional Adjoint method work effectively in distributed environment. Our method has two steps. The following part describes in detail the two steps of our method.
Evaluation and experimental results
In this section, we evaluate the accuracy and the performance of the proposed model on simulated data based on actual data of Riskalz dataset and of previous studies. To validate the resulting model and to evaluate its strength, the proposed solution involves additional steps that are detailed thereafter.
Prediction accuracy measures
Fisher’s, Student’s test and correlation coefficient
Fisher’s Ftest, also called global significance test; is used to determine if there is a significant relationship between the dependent variable and the set of independent variables. However, Student’s t test, called individual significance test, is used to determine whether each of the independent variables is significant. A Student test is performed for each modelindependent variable.
A correlation test is performed between the independent variables of the model. If the correlation coefficient between two variables is greater than 0.70, it is not possible to determine the effect of a particular independent variable on the dependent variable.
A Fisher’s test, based on Fisher’s distribution, can be used to test whether a relationship is meaningful. With a single independent variable, the Fisher’s test leads to the same conclusion as the Student test. On the other hand, with more than one independent variable, only the F test can be used to test the overall meaning of a relationship.
The logic underlying the use of the Fisher’s test to determine whether the relationship is statistically significant or not, is based on the construction of two independent estimates of σ^{2}.
Experiments
In this section, we test the proposed method on three datasets to confirm its robustness. For each case study, a brief description is given. At the end of this section, we carried out experiments and we compared the actual and predicted values for each case study.
a. Student performance case study
The dataset was collected by using school reports and questionnaires. The collected data approaches students achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features.
Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).
In the current study, apply our approach to predict the G3 attributes on the basis of the reminder ones. An exhaustive list of attributes and their description could be found at http://archive.ics.uci.edu/ml/datasets/Student+Performance.
b. Parkinsons telemonitoring case study
The dataset is composed of a range of biomedical voice measurements from 42 people with earlystage Parkinson’s disease recruited to a 6month trial of a telemonitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient’s homes.
The main aim of the data is to predict the motor and total UPDRS scores (‘motor_UPDRS’ and ‘total_UPDRS’) from the 16 voice measures. For more details readers could refers to https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring.
c. The Levenson self report psychopathy scale value case study

Participant = Identification number assigned to participant

Eye tracker = Method of eye tracking (1 = head mounted; 2 = tower)

Primary = Primary subscale of the Levenson Self Report Psychopathy Scale

Secondary = Secondary subscale of the Levenson Self Report Psychopathy Scale

Emotion: ANG = Angry expression, DIS = Disgust expression, FEAR = Fear expression, HAP = Happy expression, SAD = Sad expression, SUR = Surprise expression

Intensity: 5 = 55, 9 = 90

Sex: F = Female, M = male

Region: Eyes = Eyes, Mouth = Mouth
Thus, ANG 5 F refers to an angry expression at 55% intensity, expressed by a female face and ANG 5 F Eyes refers to the eye region of the same face.
d. The comparative study
Discussion
In this section we conducted a comparative study with the aim to position the proposed method within the methods solving the least square problem. Therefore, we use the Hadoop job performance model to estimate the job completion time given by Khan et al. [52].
This set of algorithms represents is the set of parallel method based on MapReduce to solve the least square problem.
The estimated \(\beta_{r}\) and \(\beta_{w}\) values for different HDFS file sizes
HDFS Size (GB)  Write (MB/s)  Read (MB/s)  \(\beta_{w}\) (s/MB)  \(\beta_{r}\) (s/MB) 

1  67.72  60.25  0.015  0.017 
32  61.39  85.91  0.016  0.012 
64  81.22  83.91  0.012  0.012 
128  79.56  76.15  0.013  0.013 
Number of reads and writes at each step (in bytes)
Cholesky (Bytes)  Indirect TSQR (bytes)  Direct TSQR (bytes)  Householder QR (bytes)  MRAM (bytes)  

\(R_{1}^{m}\)  8mn + Km  8mn + Km  8mn + Km  8mn + Km  Kmn + (m − 1)(n − 1) 
\(W_{1}^{m}\)  8m_{1}n^{2} + 8m_{1}n  8m_{1}n^{2} + 8m_{1}n  \(8mn + 8m_{1} n^{2} + Km + 64m_{1}\)  8mn + Km  Kmn + 8mn 
\(R_{1}^{r}\)  8m_{1}n^{2} + 8m_{1}n  8m_{1}n^{2} + 8m_{1}n  0  0  kmn + 8mn 
\(W_{1}^{r}\)  8n^{2} + 8n  8r_{1}n^{2} + 8r_{1}n  0  0  8 k + 8 k 
\(R_{2}^{m}\)  8n^{2} + 8n  8r_{1}n^{2} + 8r_{1}n  8m_{1}n^{2} + Km_{1}  8mn + Km  – 
\(W_{2}^{m}\)  8n^{2} + 8n  8r_{1}n^{2} + 8r_{1}n  8m_{1}n^{2} + Km_{1}  16m_{1}  – 
\(R_{2}^{r}\)  8n^{2} + 8n  8r_{1}n^{2} + 8r_{1}n  8m_{1}n^{2} + Km_{1}  0  – 
\(W_{2}^{r}\)  8n^{2} + 8n  8n^{2} + 8n  \(8m_{1} n^{2} + 32m_{1} + 8n^{2} + 8n\)  0  – 
\(R_{3}^{m}\)  \(8mn + Km + m_{3} \left( {8n^{2} + 8n} \right)\)  \(8mn + Km + m_{3} \left( {8n^{2} + 8n} \right)\)  \(8mn + Km + m_{3} \left( {8m_{1} n^{2} + 64m_{1} }\right)\)  –  – 
\(W_{3}^{m}\)  8mn + Km  8mn + Km  8mn + Km  –  – 
\(R_{3}^{r}\)  0  0  0  –  – 
\(W_{3}^{r}\)  0  0  0  –  – 
The computed lower bounds T_{lb} in seconds
HDFS size (GB)  Cholesky (s)  Indirect TSQR (s)  Direct TSQR (s)  Householder QR (s)  MRAM (s) 

32  802  802  1232  8224  625 
64  536  536  618  10,055  467 
128  366  366  10,475  30,994  450 
The Tables 2 and 3 confirms the performance of the proposed solution is competitive with existing methods in terms of number of operations and computational time.
Conclusion
In this paper, we carry out a comparative study between the parallel methods aiming to solve the least square estimation problem and our proposal. The results promote the use of the proposed method as the results confirm its efficiency and rapidity. Moreover, we presents a detailed description of the parallel MapReducebased Adjoint method. The application of the method to predict the Alzheimer’s disease risk confirms its robustness.
