Using Big Data-machine learning models for diabetes prediction and flight delays analytics

Nowadays large data volumes are daily generated at a high rate. Data from health system, social network, financial, government, marketing, bank transactions as well as the censors and smart devices are increasing. The tools and models have to be optimized. In this paper we applied and compared Machine Learning algorithms (Linear Regression, Naïve bayes, Decision Tree) to predict diabetes. Further more, we performed analytics on flight delays. The main contribution of this paper is to give an overview of Big Data tools and machine learning models. We highlight some metrics that allow us to choose a more accurate model. We predict diabetes disease using three machine learning models and then compared their performance. Further more we analyzed flight delay and produced a dashboard which can help managers of flight companies to have a 360° view of their flights and take strategic decisions. We applied three Machine Learning algorithms for predicting diabetes and we compared the performance to see what model give the best results. We performed analytics on flights datasets to help decision making and predict flight delays. The experiment shows that the Linear Regression, Naive Bayesian and Decision Tree give the same accuracy (0.766) but Decision Tree outperforms the two other models with the greatest score (1) and the smallest error (0). For the flight delays analytics, the model could show for example the airport that recorded the most flight delays. Several tools and machine learning models to deal with big data analytics have been discussed in this paper. We concluded that for the same datasets, we have to carefully choose the model to use in prediction. In our future works, we will test different models in other fields (climate, banking, insurance.).

must be considered: technological compatibility, deployment complexity, cost, efficiency, performance, reliability, security risk [1,12]. Data scientists are facing several challenges when dealing with high volumes of data. This includes data capture, data storage, searching, sharing, analysis and visualization.
Hadoop consists of two main components for data storage and Hadoop MapReduce [1,15]. The main advantage of Hadoop is its capacity to rapidly process data sets. This is due to its parallel clusters and its distributed file system. The second advantage of Hadoop is the fault-tolerance: Data is replicated to several nodes.

HDFS
HDFS is designed to store large amounts of data across multiple nodes of commodity hardware. HDFS has a master-slave architecture made up of data nodes which each store blocks of the data, retrieve data on demand, and report back to the name node with inventory. The name node keeps records of this inventory (references to file locations and metadata) and directs traffic to the data nodes upon client requests. If the name node fails, a secondary name node will write backups of metadata to multiple file systems [15,16]. Figure 1 shows the replication of a file into differents data nodes.

MapReduce
A MapReduce job consists of two parts, a map phase, which takes raw data and organizes it into key/value pairs, and a reduce phase which processes data in parallel [16,17]. A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element. The Map function divides the input into ranges by the Input Format and creates a map task for each range in the input. The Job Tracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reducer [13,18].
The Fig. 2 shows an example of a MapReduce job.

Limitations of Hadoop
In Hadoop version 1, MapReduce framework consists of a single master Job Tracker and one slave Task Tracker per cluster-node. The Job Tracker coordinates all jobs running on the cluster and assigns map and reduce tasks to run on the Task Trackers while Task Trackers run assigned tasks and periodically report the progress to the Job Tracker [1].

Fig. 1 Block replication in Hadoop cluster
The Job Tracker is responsible of Data Processing and Resource management (Maintaining the list of live nodes, maintaining the list of available and occupied map and reduce slots, allocating the available slots to appropriate jobs and tasks according to selected scheduling policy). The large Hadoop clusters revealed a limitation involving a scalability bottleneck [18] (https ://data-flair .train ing/blogs /13-limit ation s-of-hadoo p/) caused by having a single Job Tracker: -If job tracker fails, all jobs are lost -According to Yahoo, the practical limits of such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently -A node cannot run more map tasks than map slots at any given moment, even if no reduce tasks are running-This harms the cluster utilization because when all map slots are taken (and we still want more), we cannot use any reduce slots, even if they are available, or vice versa-Hadoop was designed to run MapReduce jobs only-Hadoop cannot run other applications like: Graph (graph processing).
In the second version of Hadoop called YARN [8] the two major features of the Job Tracker have been split into separate daemons: a global Resource Manager and perapplication Application Master [1].

Apache Spark
Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process [11]. Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Spark has the following features: • Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory [19]. • Supports multiple languages: Spark provides built-in Application Programming Interface (API) in Java, Scala, or Python. Therefore, you can write applications in different languages. • Advanced analytics: Spark not only supports Map and reduce. It also supports Structured Query Language (SQL) queries, Streaming data, machine learning, and Graph algorithms.
Spark applications run as independent sets of processes on a cluster, coordinated by the Spark Context object in the program on the master node. The driver program connects to one or more worker nodes through the cluster manager [9]. Spark can be divided into the following components [20]: Spark SQL: Spark SQL is a component on top of Spark Core that introduces a new data abstraction called Schema RDD, which provides support for structured and semi-structured data -Park Streaming: perform streaming analytics. -MLlib (Machine Learning Library) [21]: is a distributed machine learning framework above Spark -Graphx: a distributed graph-processing framework on top of Spark.
Spark is often used with distributed data stores such as MapR-XD, Hadoop's HDFS, and Amazons S3, with popular Not only SQL (NoSQL) databases such as MapR-DB, Apache HBase, Apache Cassandra, and MongoDB, and with distributed messaging stores such as MapR-ES and Apache Kafka [22]. This is a scalable and distributed NoSQL database that sits atop the HFDS. It was designed to store structured data in tables that can have billions of rows and millions of columns. HBase is not a relational database and was not designed to support transactional and other real-time applications [18].
Apache HBase enables random, real-time read/write access to big data. This has the capability and capacity of accommodating billions of rows and millions of columns atop clusters of commodity servers. Just as Bigtable leverages the distributed data storage provided by the Google file system, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS [18].

Use cases of machine learning
Machine learning is used in many domains like marketing, health, e-commerce, finance, opinion analysis [22,23].
Classification: Gmail uses a machine learning called classification to designate if an email is spam or not, based on the data of an email: the sender, recipients, subject, and message body. Classification takes a set of data with known labels and learns how to label new records based on that information.
Clustering: Google News uses clustering to group news articles into different categories, based on title and content. Clustering algorithms discover groupings that occur in collections of data.
Collaborative filtering: Amazon uses a machine learning technique called collaborative filtering (commonly referred to as recommendation) to determine which products users will like, based on their history and similarity to other users.

Related works
In recent years, many studies have been focusing on Big Data analytics and machine learning. Machine learning models can help in predicting disease [24] experimented the design of a prediction algorithm using machine learning and find the optimal classifier to give the closest result comparing to clinical outcomes. Their results show the Decision Tree algorithm and the Random forest has the highest specificity of 98.20% and 98.00%, respectively holds best for the analysis of diabetic data [25] give a detailed version of predictive models from base to state-of-art, describing various types of predictive models, steps to develop a predictive model, their applications in health care in a broader way and particularly in diabetes. Farooq and Hussain [26] developed a hybrid clinical decision support mechanism by combining evidence, extrapolated through legacy patient data to facilitate cardiovascular preventative care.
Sternberg et al. [27] propose a taxonomy and summarize the initiatives used to address the flight delay prediction problem, according to scope, data, and computational methods. Chen and Li [28] propose a delay propagation model as a link to connect features to build a chained delay prediction model. Zettam et al. [29] described a MapReduce-based Adjoint method for preventing brain disease [30], mobile phone data were collected and the customer's gender and age were predicted. The authors analyzed Call Data Records (CDR), billing data and other customer's information and applied different types of Machine Learning algorithms to provide marketing campaigns with more accurate information about customer demographic attributes (age, gender). The model applied to 18,000 users information achieved 85.6% accuracy in terms of user gender prediction and 65.5% of user age prediction. Dahdouh [31] developed a distributed courses recommender system for the e-learning platform that aims to discover relationships between student's activities (historical logs) using association rules method in order to help students to choose the most appropriate learning materials. Their experimental results show the effectiveness and scalability of the proposed system.
Al-Saqqa, Al-Naymat and Awajan [4] used apache Spark with MLIB to perform sentiment analysis on customer's reviews on a product. His experimental results showed that Support vector machine classifier outperforms Naïve Bayes and logistic regression classifiers [21], as electronic commerce, customers can make electronic payments. However, the system needs confidence by making fraud detection a critical factor. Authors presented a Scalable Real-time Fraud Finder (SCARFF).

Diabetes prediction
Diseases prevention has been one of the uses of new technology in particular Machine Learning algorithms [32,33] medical devices with sensor, health cloud and continuously generating a huge amount of data which is often called as streaming big data. Over the last few decades, heart diseases and diabetes are the most common cause of global death. So early detection of these diseases and continuous monitoring can reduce the mortality rate [32].
Diabetes is a chronic disease or group of metabolic disease where a person suffers from an extended level of blood glucose in the body, which is either the insulin production is inadequate, or because the body's cells do not respond properly to insulin [24]. There are four types of diabetes which are type 1, type 2, gestational and pre diabetes [25].
We used machine learning based algorithms to predict diabetes. The model were tested using python as the programming language. The Python language has diversified application in the software development companies such as in gaming, web frameworks and applications, language development, prototyping, graphic design applications. This provides the language a higher plethora over other programming languages used in the industry. Some of its advantages are-Three Machine Learning algorithms were carried on diabetes datasets: Linear regression, Naive Bayes and Decision Tree [26]. The dataset used contains 7 features and we want to predict the class of a given person (1: positive, 0: negative). We calculated the accuracy, the score and the RMSE of each model. The Fig. 3 shows the Chart flow of the model.

Proposed algorithm
Data cleaning refers to the process of removing invalid data points from a dataset. Many statistical analyses try to find a pattern in a data series, based on a hypothesis or assumption about the nature of the data. Data cleaning is one of the important parts of machine learning. It plays a significant part in building a model. Data cleaning is one of those things that everyone does but no one really talks about in the process we ignore these particular data points, and conduct our analysis on the remaining data. The algorithm used is as follows: 1. Data loading and cleaning 1.1 Import the library pandas. Before performing any machine learning algorithm, first we analyzed the dataset. This step is necessary to familiarize with the data, to gain some understanding about the potential features and to see if data cleaning is needed. Data cleaning is considered to be one of the crucial steps of the work flow, because it can make or break the model. Dataset cleaning includes correct duplicate or irrelevant observations, missing or null data points etc.
Feature engineering is the process of transforming the gathered data into features that better represent the problem that we are trying to solve to the model, to improve its performance and accuracy. After the data was cleaned, Model selection or algorithm selection phase is the heart of machine learning. It is the phase where we select the model which performs best for the data set. It is a general practice to avoid training and testing on the same data. The reasons are that, the goal of the model is to predict out-of-sample data, and the model could be overly complex leading to over fitting. The model evaluation is done by the following metrics [34]: • Accuracy: The percentage of correct classifications (values from 0 to 100). It indicates the classifier's ability to correctly guess the proper class for each element.
The Eq. 1 gives the accuracy a model: The accuracy is obtained by calculating the following parameters: • True Positives (TP): The cases in which we predicted YES and the actual output was also YES. • True Negatives (TN): The cases in which we predicted NO and the actual output was NO. • False Positives (FP): The cases in which we predicted YES and the actual output was NO.
(1) Accuracy = Number of correct predictions Total number of predictions made .
• False Negatives (FN): The cases in which we predicted NO and the actual output was YES.
From (1) • F-score: The weighted average of precision and recall of classifications (values from 0 to 1). F1 score is the Harmonic Mean between precision and recall. It allows knowing how many instances the model classifies correctly, as well as how robust it is. High precision but lower recall, gives you an extremely accurate, but it then misses a large number of instances that are difficult to classify. The greater the F1 score, the better is the performance of the model. The Eq. 3 gives the expression of F-score where .

• Mean Absolute Error
Mean Absolute Error is the average of the difference between the original values and the predicted values. It gives us the measure of how far the predictions were from the actual output. However, they don't give us any idea of the direction of the error i.e. whether we are under predicting the data or over predicting the data. Mathematically, it is represented as: • RMSE: Standard deviation between the real and predicted values via regression. It is quite similar to Mean Absolute Error, the only difference being that MSE takes the average of the square of the difference between the original values and the predicted values. The advantage of Mean Absolute Error (MSE) being that it is easier to compute the gradient, whereas Mean Absolute Error requires complicated linear programming tools to compute the gradient. As, we take square of the error, the effect of larger errors become more pronounced then smaller error, hence the model can now focus more on the larger errors.

Flight delays analytics
Flight delays create problems in scheduling, passenger inconvenience, and economic losses, there is growing interest in predicting flight delays before hand in order to optimize operations and improve customer satisfaction. 1 With rapid growth of air traffic, increasing flight delays in the United States (US) have become a serious and prominent problem. According to the Bureau of Transportation Statistics (BTS), nearly one in four airline flights arrived at its destination over 15 min late (https ://www.sita.aero/air-trans port-it-revie w/artic les/using -artifi cial -intel ligen ceto-predi ct-fligh t-delay s, https ://data-flair .train ing/blogs /13-limit ation s-of-hadoo p/). It is reported that the annual total cost of air transportation delays was over $30 billion, which poses a significant challenge to the development of Next Generation Air Transportation System [28]. Delay is one of the most remembered performance indicators of any transportation system [27].
A flight delay is said to occur when an airline lands or takes off later than its scheduled arrival or departure time respectively. Conventionally if a flight's departure time or arrival time is greater than 15 min than its scheduled departure and arrival times respectively, then it is considered that there is a departure or arrival delay with respect to corresponding airports. Notable reasons for commercially scheduled flights to delay are adverse weather conditions, air traffic congestion, late reaching aircraft to be used for the flight from previous flight, maintenance and security issues [35].
Managers would like to know which day of week has the most delayed flight, the number of delayed flights for each airport of departure and airport destination as well as other statistics which can help decision making. The objective of this study is to perform analysis of the historical flight data to gain valuable insights. Using machine learning models, we can build a predictive model to predict whether a flight will be delayed or not given a set of flight characteristics. The analysis could be able to answer to the following questions: Which Airports have the Most Delays? Which routes are typically the most delayed? Airport Origin delays per month, Airport Origin delay per day/hour, what are the primary causes for flight delays? In our model we used spark as the storage and data processing, Scala and Spark SQL for our analytics. The Fig. 4 shows the workflow of the model.
As we saw in the previous section, for getting reliable information from data, we have to perform many steps from the data acquisition to the results analytics.
Spark is an in-memory tool for a fast data processing. We used flight information from the United States Department of Transportation. After downloading flight data, the first step was to load our data into a DataFrame. We used a Scala case class and StructType to define the schema, corresponding to a line in the JSON (JavaScript Object Notation) data file. We loaded the data from 2 months (January and February 2018). We performed a column transformation; we added a new column "orig_dest" for the origination-destination airport, in order to use this as a feature. Then we could query the DataFrame to get useful information like the count of departure delays by origin_destination, the day of the week with delayed flight. Nibareke and Laassiri J Big Data (2020) 7:78

Experiments
For our experiment we used a single node cluster. The setup includes an operating system Windows 10, 64 bytes with an I5 Control Processing Unit (CPU), each containing 8 cores clocked at 3 GHz, and 4 GB Random Access Memory (RAM) and a total storage space of 350 GB. For speeding up the computation, an hybrid plateform with many nodes or cloud can be used for storing and processing data. We used anaconda as plateform which can be installed on a single node cluster. For the performance evaluation of algorithms we compared three algorithms for predicting diabetes disease for a person. We installed anaconda version 3 available at the link (https ://repo.anaco nda.com/archi ve) and used Python 3+ as the programming language. We used Spyder which is a Python Development Environment with a rich interactive testing and debugging. We imported some libraries to perform our algorithm like ScikitLearn, Numpy, Pandas, Matplotlib, and Seaborn.
Diabetes is a disease which occurs when the blood glucose level becomes high, which ultimately leads to other health problems such as heart diseases, kidney disease etc. The datasets was provided from (https ://datah ub.io/machi ne-learn ing/diabe tes#resou rcediabe tes). The diabetes dataset have been produced by the National Institute of Diabetes and Digestive and Kidney Diseases and provided by Vincent Sigillito (vgs@aplcen.apl.jhu. edu) Research Center, RMI Group Leader Applied Physics Laboratory The Johns Hopkins University Johns Hopkins Road Laurel, MD 20707. The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria: if the 2 h post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). All patients here are females. For the male patients other parameters should been considered. The data used for our experiment is a Comma Separated Value (csv) file of 36 KB (diabetes.csv). The file contains 768 datasets with 7 features and each dataset is labeled with the class (positive: 1, negative: 0) as shown in Table 1.
For the flight delays analytics, the datasets was provided on https ://githu b.com/maprdemos /mapr-spark 2-ebook . The input file is a JSON format of 8.41 MB with 413,348 rows and 12 features (Table 2).
In our experiment, we used Scala as the programming language and the databriks cloud platform which contains a notebook and a spark session for executing the Scala code. Using cloud plateform provided at https ://commu nity.cloud .datab ricks .com/login .html that can be accessed by creating user account which is the mail address. We created a cluster which run the Scala jobs. The first step is to load the json file in the data directory of cloud plateform and then we create a cluster and start it.

Results and discussion
The figure shows the graphics of diabetes datasets distribution for each feature: The Fig. 5 shows that almost 250 persons in our sample were between 20 and 30 years old, more than 250 persons had between 0 and 250 U/ml for insulin. We can see on the graphs that diabetes pedigree was between 0 and 2 with the majority of persons having between 0 and 0.3 as the diabetes pedigree function. We tested three The performance evaluation of the three models-Decision tree, Naïve Bayesian and Linear Regression was done by calculating the three parameters (accuracy, score and RMSE). A good model is the model having the score close to 1 and RMSE close to 0. These parameters determine the accuracy of the model and a good prediction with minimum errors. Table 3 shows the accuracy, score and RMSE of the three models. For Decision Tree model, the RMSE was closed to 0 while the trend of the score is 1. This can occur when the expected classes outputs are exactly matched by actual outputs. That means the model is ideally trained and goal is 100%, this means the features were enough strong to result in 100% classification rate. These values mean that the model was accurate in prediction.
The three models show the same accuracy (0.766), Linear regression and Naïve bayesian give errors (RMSE) which are close equal (Fig. 6).
The experiment shows that the three models give the same accuracy but Decision tree outperforms the two other models with the greatest score (1) and the smallest error (0). Given a dataset of features, the model is able to predict the positivity or negativity of a person. For example .Testing our model with [6, 109, 60, 27, 0, 25, 0.206, 27] gives the 0 as class which means the person was tested negative.
While analyzing the flight delays, the results showed that Sunday and Monday are the days of the week with the highest average of departure delay flights while on Wednesday we had the lowest average (Fig. 7). The ORD airport has the highest  . 6 Performance of the models number of delayed departures (Fig. 8). The routes ORD->SFO and DEN->SFO have the highest delays, possibly because of weather in January and February (Fig. 9). Figure 10 shows that the number of delayed flight was equal to 4558 and the flight who was not delayed equal to 36790. We can calculate the rate = delayed flight/not delayed    flight = 0.12 this ration shows the quality of service. If the ration is greater than 1, which means the number of delayed flight is greater than the number of not delayed flight, managers will have to take strategic decisions to improve the service. These analytics gave a good dashboard for the managers.

Conclusion
Machine Learning algorithms can be used to get insights value from data. This can help in decision making and make predictions in many domains (Healthcare, Financial, Marketing, Recommendation engines etc.). Machine Learning algorithms are commonly classified into supervised, unsupervised, semi-supervised, reinforcement learning and transfer learning). These algorithms can be used to make classification, clustering, collaborative filtering from a dataset. The accuracy of each algorithm depends on the values of some parameters like accuracy, score and error. Diabetes is defined as a chronic disease where a person presents an extended level of blood glucose in the body (The production of insulin is inadequate or the body's cells are not responding to insulin). Diabetes can cause serious complication to people's health. The prediction of diabetes can help in taking strategic decision of preventing one's health. In this paper, we give an overview of Big Data and tools, Machine Learning algorithms and performance comparison. We experimented a use case of Big Data analytics using Spark by analyzing flight delays.
To make a good prediction or classification, we have to use a adequate machine learning algorithm. In our experiment we try to predict diabetes by using Naïve Bayesian, Linear Regression and Decision Tree. We calculated and compared three parameters: accuracy, error and score for the three models. With decision tree, a score equals to 1 and an error equal to 0 were achieved. Decision tree is the most appropriated model for the case of study. We concluded that Decision tree is the best model for this use case while Naïve bayesian gave worst values of score, accuracy and error (RMSE).
We analyzed flight data to help decision making. The model used can show on which day of week, which airport of departure and airport destination we have the most delayed flight. From this information managers can adopt strategy in order to minimize and anticipate the impact of the flight's delay on customer satisfaction.
In our future works we will continue to explore other domain like business, marketing, social networks, climate in order to find out how big data can be analyzed for getting value. We will experiment how different tools and models can be associated to achieve the highest performance in time, correctness, and resources optimization.