Flight delay prediction based on deep learning and Levenberg-Marquart algorithm

Flight delay is inevitable and it plays an important role in both profits and loss of the airlines. An accurate estimation of flight delay is critical for airlines because the results can be applied to increase customer satisfaction and incomes of airline agencies. There have been many researches on modeling and predicting flight delays, where most of them have been trying to predict the delay through extracting important characteristics and most related features. However, most of the proposed methods are not accurate enough because of massive volume data, dependencies and extreme number of parameters. This paper proposes a model for predicting flight delay based on Deep Learning (DL). DL is one of the newest methods employed in solving problems with high level of complexity and massive amount of data. Moreover, DL is capable to automatically extract the important features from data. Furthermore, due to the fact that most of flight delay data are noisy, a technique based on stack denoising autoencoder is designed and added to the proposed model. Also, Levenberg-Marquart algorithm is applied to find weight and bias proper values, and finally the output has been optimized to produce high accurate results. In order to study effect of stack denoising autoencoder and LM algorithm on the model structure, two other structures are also designed. First structure is based on autoencoder and LM algorithm (SAE-LM), and the second structure is based on denoising autoencoder only (SDA). To investigate the three models, we apply the proposed model on U.S flight dataset that it is imbalanced dataset. In order to create balance dataset, undersampling method are used. We measured precision, accuracy, sensitivity, recall and F-measure of the three models on two cases. Accuracy of the proposed prediction model analyzed and compared to previous prediction method. results of three models on both imbalanced and balanced datasets shows that precision, accuracy, sensitivity, recall and F-measure of SDA-LM model with imbalanced and balanced dataset is improvement than SAE-LM and SDA models. The results also show that accuracy of the proposed model in forecasting flight delay on imbalanced and balanced dataset respectively has greater than previous model called RNN.


Introduction
As the air travels have a significant role in economy of agencies and airports, it is necessary for them to increase quality of their services. One of the important modern life challenges of airports and airline agencies is flight delay. In addition, delay in flight makes passengers concerned and this matter causes extra expenses for the agency and the airport itself. In 2007, U.S government had endured 31-40 billion dollar downsides due to flight delays [1]. In 2017, 76% of the flights arrived on time. Where, in comparison to 2016, the percentage of on time flights decreased by 8.5% [2]. As some of the reasons of flight delays the following can be mentioned: security, weather conditions, shortage of parts and technical and airplane equipment issues and flight crew delays [3][4][5]. Delay in flight is inevitable [6], which has too much negative economic effects on passengers, agencies and airport [7][8][9][10][11]. Furthermore, delay can damage the environment through fuel consumption increment and also leads to emission of pollutant gases [1,[12][13][14][15][16]. In addition, the delay affects the trade, because goods' transport is highly dependent to customer trust, which can increase or decrease the ticket sales, so that on time flight leads to customer confidence [17,18]. So that, flight prediction can cause a skillful decision and operation for agencies and airports, and also a good passenger information system can relatively satisfy the customer [19].
According to abundant and diversity of reasons for flight delays, We are faced with a massive amount of data which is not possible to be processed through previous methods of data [17] analysis like classification [1], or the decision tree [8] and machine learning based methods [1,2,17,20,21] to process this volume of data are not proper, because characteristics of older intelligent system has been designed by human and usually were personalized, also people rarely perceive some features and usually neglect these matters. On the other hand, in older learning process, as the number of categories available for classification increases, the level of difficulty increases [8] and extraction of important and effective features becomes relatively impossible. Due to complexity and effect of parameters on each other, the problem of flight delay prediction is considered as NP-Complete [22]. Furthermore, the problem essentially is accompanied by oscillation and also these are considered as non-linear problems [23]. On the other hand, applied data includes noise and error that should be handled to cope with the problem [24,25].
There have been too many studies in this area. For example, older Regression method [26] has been used to compute delay propagation. For this model, the destination delay is highly dependent to arrival flights and the effective factors include; day, time, airport capacity and some factors are related to passenger loads. In addition, as the problem neglects the weather conditions, this model shows inefficiency in U.S.A but it is suitable for Europe. Where, only 1-4% of the Europe flights delayed due to weather condition, this value for U.S.A is between 70 and 75% in [27] an intelligent neural network has been designed which estimated the destination delay for actual applications in controlling traffic progress. This model employs factors of airport type, airplane type, date, time, flight path, flight frequency for network training and non-linear and linear for data analysis. As it is difficult to interpret neural network parameters, the way factor behavior and most important verification of the most important factors in flight is extremely difficult. Furthermore, older intelligent algorithm usually uses shadow learning models to solve conditions with a big data in complicated classifications. However, results of this analysis are very different with respect to ideal condition. Although model design can have a good or bad situation, response is highly dependent to experience and even happenstance and this procedure requires too much time. Therefore, traditional simulation and modelling techniques is not suitable or even efficient for such problems. There is an ongoing subject of study which solves this problem and this paper also has tried to use that subject in modelling.
One of the newest modern methods in solving such extended and complicated accompanied by bulky data that has been concerned by many scientists is deep neural networks [21,24,25,28]. The design of learning technology is taken from human neural network learning is a branch of machine learning and collection of algorithms that trying to model such high-level abstract contents through application learning in different layers and levels. Therefore, this subject enables the deep learning to process a bulky data volume in complicated data classification [29]. Moreover, this structure is proper for extracting some the characteristics, so that learning is capable to extract maximum number of possible characteristics [29]. Layered network structure and capability of computation for each data scale has led to progressing application of techniques. This networks have different types including convolutional Neural Network [30], Autoencoders [31], Restricted Boltzmann Machine [32] and Sparse coding based method [33] that each of them is applicated for specified problem.
One of the recently presented works in solving problem employs the recurrent Deep Neural Network and its results has a high accuracy in flight delay prediction [24]. However, this model has drawbacks of overfitting, that researchers have solved that through typical data dropout technique for each step of repeated training procedure. Moreover, application of this method decreases the computation time and memory space during the training.
The next drawback is the noise of input data. However, the researcher neglects the noise during prediction. This paper tries to represent a model based on deep learning, which considers the effective factors in the delay. Moreover, noisy data [24,25] requires utility of stack denoising autoencoder (SDA) in designing the model. Afterwards, optimized structure of the flight delay forecasting model with Levenberg-Marquart (LM) algorithm. In addition, in this paper by developing a deep learning-based model, the accuracy of flight delay predictions can be increased.
Finally, we review previous work related to our topic in "Literature review" section, a complete description of research process and also the holistic structure of the designed model is represented in the third section. Fourth section evaluates the determined results from the previous methods. Fifth section presents a conclusion and an overall view about the study.
In one of the best studies [56] that has been performed based on statistics delay time has been considered to be reduced. Their study has investigated important factors before fly and those which occur on the ground. In the next step, it has predicted the delay at destination based on factors that occur in the vicinity of arrival time at destination. Eventually, results have shown that whenever, the delay is correctly predicted, passenger disaffection and fuel consumption decrease and consequently number of flight increases. Moreover, it is possible to increase the agencies' benefits through reducing number of passengers who wrongly selected their routs or specifying the probabilities for some flights and optimizing delay time prediction.
Another prominent investigation based on Probability [57] has been done and the author believes that huge storm in U.S.A has highly affected the flight delay. This study has been devoted to predict delay based on mathematical calculations and through considering delay time duration of the flights that had been engaged to storm in the same day. Metrological reports have shown the effect of storm one hour before and after event cause ephemeral climate at the region. In the next step, Monte-Carlo simulation has been used to estimate the airport runway capacity, so that traffic of each runway would have been estimated. As the research has employed only one factor, the model has not enough accuracy, but it is possible to increase region air capacity path structure [57].
A model has been presented in [82], which is one of the best network-based models. The researchers have presented a model based on Bayesian and Gaussian mixture model-expectation maximization (GMM-EM) algorithm to predict and analyze the factors affecting the flight delay in Brazil for several point along the path. At the first stage of model, the degree of effectiveness for each factor is specified and then it has specified investigated whether the delay had happened in a greater domain or no. the next delay probability is computed using GMM-EM [82] and EM algorithm which are specified based on similarity. The result has shown that it is possible to predict the probability of delay in higher levels through specifying low level factors. Moreover, GMM-EM [82] similarity function has more values rather than EM algorithm [82] in each step, so that the results would have been converged sooner. In addition, the model accuracy is increased, so that the prediction is more trustable.
One of the best studies [93] in the area of operating method has been presented. Studied the effects of capacity and damage on different levels of delay in American airports.
Other simulations focus on stability and reliability during the delay and its propagation. For instance, in [90] the problems of congestion were studied. Then, a queue-based model was presented for analyzing delay propagation in consecutive flights in the Los Angeles airport.
One of the best studies [119] in the area of machine learning method has been presented by a model which applicate machine learning techniques to investigate delay in arrival flights. This research firstly has extracted important characteristics and then has been used for both neural networks and deep believe network through arbitrary samples to train the model. The model utilizes Memento [119] and Resilient Back Optimized Propagation [119] that the Resilient back propagations quicker than back propagation [119] and as a result the model training and consequently has been increased. Deep believe networks [119] is based on a few Boltzmann machine [119] that each communication layer receives data from the previous layer and in each step a Boltzmann machine [119] is added to Believe Network overall, training time reduced using parameter adjustment operation and learning rate, false classified error rate. As each layer has convergence at the output, training speed is reduced and the gradient approaches zero. In addition, a relatively small data base is used for the model because of limited system capacity. So that this problem leads to a noticeable reduction in prediction precision whenever it is not at database.
A model has been presented [125] which was one of the machine learning method. the researcher has presented a model based on support vector regressor (SVR) algorithm to predict flight delay in U.S.A airports. Due to the large amount of data, the data was grouped and sampled by month. At the first stage for categorical variables, cat-boost used the ordered boosting method. Because cat-boost itself had the effect of scoring features, it was possible to select parameters that were more important to the model when the threshold was unknown, so cat-boost was used to evaluate the features of each feature to select features, and finally 15 features were selected to build a training model.
Then has been used several common regression prediction algorithms to predict the delay at the same time for the round-trip flight between John F. Kennedy International Airport and O'Hare International Airport.
Finally, the specific delay time was predicted. The results have shown SVR has the best prediction result for the flight delay time with the best accuracy value was 80.44%. Also, the time characteristics had a large impact on the mode performance.
The air time and flight distance would also have a greater impact on on-time performance of specific flight; Different carriers and specific aircraft would also have a slight influence of on time performance. Accuracy of this model is low because detailed weather and aircraft data could not be collected.
A research [126] analyzes flight information of U.S domestic flight operated by American Airlines, covering top 5 busiest airports of US and predicting possible arrival delay of the flight using Data Mining and Machine Learning Approaches. Due to the imbalanced data, Over-Sampling technique, Randomized SMOTE was applied for Data Balancing. The Gradient Boosting Classifier Model was deployed by training and then Grid Search on Gradient Boosting Classifier Model on flight data, caused hyper-parameter tuned and achieving a maximum accuracy of 85.73%. Result showed that deleting some features affected the value of accuracy and reduced it.
A group of researchers [127] have designed 5 models to predict flight delay based on machine learning models such as Logistic Regression, Decision Tree Regression, Bayesian Ridge, Regression and Gradient Boosting Regression. They collected data from Bureau of Transportation, U.S. Statistics of all the domestic flights taken in 2015 and predicted whether the arrival of a particular flight would be delayed or not.
The metrics to evaluate the performance of the models were: Mean squared error (MSE), Mean Absolute Error (MAE), Explained Variance Score, Median Absolute Error and R2 Score. Due to the used of imbalanced data sets, the amount of calculated error was high. Based on the results, Random Forest Regressor was observed as the best model in prediction of arrival and departure delay.
One of the newest studies in the area of machine learning method has been presented by a model which applicate supervised learning methods to aggregate flight departure delays in china airports [128]. The expected departure delays in airports was selected as the prediction target while four popular supervised learning methods: multiple linear regression, support vector machine, extremely randomized trees and LightGBM were investigated to improve the predictability and accuracy of the model. Of special note was that the model performances with local weather characteristics was not as good as those without Sustainability meteorological data.
They measured accuracy, MSE and MAE for evaluating 4 methods and result has shown LightGBM model could provide the best result, giving 0.86 accuracy.
A group of Researchers [129] designed a framework to integrate multiple data sources to predict the departure delay of a scheduled flight and discuss the details of the data pipeline. They were the first group, to take advantage of airport situational awareness map, which was defined as airport traffic complexity (ATC), and combined the proposed ATC factors with weather conditions and flight information.
In the first stage, historical data, weather condition data, and tarmac aircraft and vehicles GPS data were collected from different data sources. After that the feature extraction stage, was applied principal component analysis to weather data, and were extracted ATC features from tarmac aircraft and vehicle trajectory data, also utilize the historical scheduling table data. It seems that except for the extracted features more potentially useful features can be explored from the airport situational awareness map. Then in the modelling stage, multiple datasets were combined and various data combinations were used to train a regressor model that could be used for predicting departure delay time.
Authors selected four popular regressors from different families (linear regression, SVR, ANN, and regression trees) to show the robustness of their proposed approach to different regressors. Finally, has been evaluated the prediction results using Root Mean Square Error (RMSE) to measure the performance of flight delay time prediction using different models and different combinations of data sources. Result has shown LightGBM regressor outperforms other conventional regressors with extensive experiments on a large real-world dataset.
Although Other works which have been done in recent years is not in the scope of this article, it is still related to the topic in a way that contributes to the progress of this article, so here we have included studies [130] that employed a support vector machine (SVM) model to explore the non-linear relationship between flight delay outcomes and another model that [131] explored a broader spectrum of factors. This model could potentially affect the flight delay and proposed a gradient boosting decision tree (GBDT) based models for generalized flight delay prediction.
The presented techniques are faced to limitations, because these techniques cannot resist against the massive data volume and complicated computations. For example, in some of these studies, the model is designed based on the specifications and conditions of a special country [43-45, 73, 75, 100, 104]. Some other consider weather conditions in their prediction [38,132], next group has considered the special situation like en-route [5,82] or destination [61,88].

Deep neural networks
Deep neural networks are composed of several hidden layers that each layer has an important role in learning the model [133,134]. Therefore, actual learning process is repeatedly performed through theses layers [133,134]. Therefore, it can be inferred that the difference of deep learning techniques from older method is the learning part and lack of limitation in amount of data and also finding the best solution for NP-Complete problems [22]. Deep learning is employed in different areas including speech recognition [135][136][137][138][139], machine vision [30,[140][141][142], language processing [143][144][145], recommender systems [146], urban traffic forecasting [147] and air traffic [70,71,78,96]. It is clear that raising the number of variables in forecasting, modeling and simulation results in more precise final model that is achieved using deep neural networks. The remaining part of the section investigates previous studies in flight delay forecasting.
One of the newest studies [24,25] has been presented which solves problems with massive data volume. This research has designed high precision model for forecasting U.S.A flight delays, which employs Recurrent Deep Neural Network. The research aimed to firstly compute daily delay for each airport and then estimate the delay for a special flight using results of the primary step. This study has used recurrent deep neural network, which stores information of each hidden layer and this matter increases the model performance. Although model has high precision but high model complicacy has led to depth increment and finally takes the model to overfitting state that has been solved using dropout techniques. Moreover, employing this technic, can reduce the computation time and memory space during training. Next challenge is the extremely noisy input data, that the author has neglected in the data during forecasting, which is highly effective in forecasting.
Some research [148] have designed a model to forecast which is based on Bayesian networks and long-and short-term memory (LSTM)that uses discretizing variables like water and air, crowd and airport parameters to compute daily delay for some airports in USA.
This model is composed of three memory layers in network and also uses earlier four days to compute average delay for the destination. Moreover, non-specify or properties are extracted through Mont Carlo Dropout techniques. Although the research has determined a stabilized state between complicacy and overfitting using variable dimension reduction, although it cannot forecast some unique event that highly affect the delay.
Some researchers [132] have investigated the weather conditions and its effect on origin-destination delay and used one of weather underground (WU) protocols related to some variables of wind, that temperature and morning dew. In addition, the following tools including Apache spark, the Analysis service, Elastic tools are used to analyze the data, which Apache spark is a processor for parallel computations and libraries for machine learning. Statistical findings showed that 89% delays were due to wind. In the next step, they specified the correspondence between variables through dependencies decision tree and associate laws and then they have computed probability of delay occurrence using linear Regression. Moreover, using associate laws, they have proved that next factor in the delay is the humidity. As their researches only have investigated the weather conditions of some airport, so that only 10% of the flights have been postponed. Moreover, their research has not investigated the weather condition during the flight. Therefore, the model can be used only for some specified airports in specific states.
A group of researchers [149] have designed 2 models to forecast delay, one of which was based on long-and short-term memory (LSTM) [148] and the other model was based on Random Forest. In this study to create a dataset, the ground station continuously received automatic dependent surveillance broadcast (ADS-B) messages and then uploaded to central cloud server. After that the weather information of airports, scheduled flight time, departure airport, and destination airport were collected and then were integrated them. The random forest classification architecture was constructed in this model, then the ensemble classifier used the most voted result of the N sub-classifiers as its prediction. The ability of each sub-classifier and the independence of the sub-classifiers jointly improved the model accuracy. Experimental results have shown that Random Forest based method could obtain good performance with the best accuracy was of 90.2 and the LSTM-based architecture can obtain relatively higher training accuracy, but overfitting problem occurred in limited dataset.
One of the newest studies has been presented [150] which could solve problems with high-dimensional data and considered its relationship with space and time. this research was designed high accuracy model for forecasting U.S.A flight delays, which employed Stacked autoencoder. A stacked autoencoder was adopted to train networks and optimizing all the networks' parameters with back propagation method. The model revealed the evolution rule of flight delay in space-time variation and superior after being compared with the performance of traditional neural network. Results from plenty of experiments had implicated that the prediction accuracy with deep stacked autoencoders was above 90%.
In one of the best studies that has been performed based on deep learning a framework was designed which has three parts: command executive, data structure, and utilities [151]. The command executive was described to provide the communication channel between the user and the functions. The information such as flight plan and airport parameters via the data structures were defined as the inputs of the functions. The utilities were known to contain common operations and tools to facilitate the commend executive and data structures. This platform supporting the FAA's Collaborative Decision-Making (CDM) process with the intent of reducing flight delays in the NAS Based on deep learning algorithms and used LTSM to predict accurate arrival and departure delays using time series data. This system at first could integrate various databases to the NextGen's SWIM program framework and then it could predict flight delays. Finally, in this study assessments of risks and sustainability of the proposed platform were presented. Based on the results they demonstrated that this platform can save billions of dollars and millions of hours, respectively but it is not possible to use this framework for everyone.
Some Research [152] combined a deep belief network with a support vector machine to create a prediction model (DBN-SVR), in which DBN extracted the main factors with tangible impacts on flight delays, reduced the dimension of inputs, and eliminated redundant information. The output of DBN was then used as the input of the SVR model to capture the key influential factors (leading to flight delays) and generated the prediction value of delays. They employed a grid-search method to identify the key parameters in SVR and selected the optimal parameter values. After training the DBN-SVR model with proper parameter tuning, have been tried to detect and characterized the key influential factors using the observed DBN. Finally, the prediction performance was described by MAE and RSME. The MSE was finally employed to measure the importance of input factors and detected the key influential factors. Results have shown that air traffic control was one of the key influential factors. Also, there was a strong relationship between the average delay of current and previous flights during 16:00-22:59, so that delays occurring in the afternoon and evening flights have a higher possibility of propagating and affecting the subsequent flights.
Some research [153] was carried out by employing quantitative research method. Author focused mainly on predicting airlines flight delays by analyzing flight data, especially, for the domestic Airlines that moves around the United States of America. The main aim of the study was to reduce the number of data dimension before feeding it to the deep learning network. The primary dataset was filtered first from more than 100 feature to one third of it. According to this study, before deep learning model implementing, dataset need to divide into train and test sets. Train set was divided randomly 80%, while the test set contained 20% of the whole data. Train set was used to train the deep learning model. Where test set was used to check the accuracy by using confusion matrix performance measures.
Author used mainstream classification machine learning and deep neural networks to classify whether a flight would be delayed or not. For the machine learning algorithm, Decision Tree was used while for deep neural network as the name stands Deep Artificial Neural network (DANN) was used. They showed that the accuracy of DANN was slightly higher than the Decision Tree, however, even a tiny difference in accuracy was believed to be of tremendous valuable since the dataset was enormous and number of flights per day is numerous.
Based on the results of this study, with the reduced number of features the accuracy did not change. Also, the best accuracy was 82.10%. Therefore, several experiments had been carried out with the same setup with different number of neurons and hidden layers. Surprisingly, there was no clear differences in accuracy rate. But when the number of hidden layers increased then the accuracy was 81.80%. So, it can be concluded that number of increased hidden layer did not ensure with higher accuracy.
According to the recommended structure [24,25], one of the recent studies in this area still has some problems such as overfitting or memory space shortage. Moreover, data noise is neglected. These problems are effective in model forecasting precision.

Methodology
In this section, we issues to represent a our technique in which we tried to solve the problems related to massive data, processing complications [21,24,25,28], lack of computational space, overfitting and existing noise in data [24,25]. Figure 1 gives an illustration of the development of proposed model. As can be seen from the figure the proposed technique contains three phases. We descript most important notation in the Table 1.

First phase: data collecting and pre-processing
Firstly, and at the beginning of the phase, it is necessary that model inputs be determined so that based on them, model learn and result in final structure. The dataset used for evaluating the model was obtained from historical data which contains flight schedules data for 5 years. Variables which are used as inputs are shown in Table 2. It is applied to real-world data collected from the airports in the U.S and is compared with existing flight delay predictors. After collecting data, characteristics enters system as X vectors which contains all variables in form of X = {X 1 · X 2 · . . . · X 19 } . In this model each X i represents a single characteristic. Since these characteristic's adjustment range has lots of oscillation and no accordance to each other, pre-processing must be operated on the database. Thus, we look for normalization techniques and among them, we use 'min max' normalization one. This technique is mostly known as Feature Scaling in which Eq. 1 is used for each variable normalization.
In (1) X i represents each variable and min(x) shows the lowest value in series and its number is zero. Max(x) represents highest number in series and has value of 1. Of all variables, ones related to time and flight information is normalized based on Eq. 1 and Fig. 2 shows min-max algorithm. Delay is calculated using timing difference of fields ArrTime and ArrDelay in beginning and DepTime and DepDelay in destination and if flight delay is more than fifteen minutes, values of DepDel15 and ArrDel15 fields in that flight turns 1, otherwise it turns 0. Also, flights delay due to various reasons which in database are divided into five general categories: CarrierDelay, WeatherDelay, NAS-Delay, SecurityDelay and LateAircraft which value of 1 in these fields determines flight delay cause or causes. WeatherDelay field is weather conditions related where weather (1)

Second phase: pre-training model-building stack denoising auto encoder (SDA)
After pre-processing phase, second phase initiates, in which model enters pre-training phase that the training algorithm of a denoising autoencoder is summarized in Fig. 3. Normalized variables enter first denoising autoencoder as inputs and is mapped to first hidden layer in form of X 1 i → h (i+1) [154]. Then, some of characteristics of X vectors inputs will be decayed randomly by rate of c. There are different methods to decay data and, in this study, used zero mask, meaning we change the value of those variables to zero and organize vector X . Therefore, encrypting phase begins and X vector is encrypted in hidden code H and its value is calculated based on Eq. 2 [154].
W represents variable's Weight and b represents its Bias. When an input enters a neuron, it's categorized by a Weight. In addition to Weight, another linear component that affects input is called Bias and its value is added to Weight multiplier in input in order to change the range of resulted value from Weight multiplier in input. Bias is the last linear component which assists input conversion. The initial Weight and Bias is randomly assigned and are updated during training process. After training process beginning, Neural network assign more Weight to inputs that it considers more important. Having Weight of 0, show Ineffective variable. After encrypting phase, X vector is reconstructed based on Eq. 3 and using hidden code H, resulting in X vector. This phase is well known as decoding phase. X vector is transferred into output [154]. W T show transposition of the weight matrix w and b h show the bias associated with each hidden code, After the decryption process is completed, the reconstruction error rate [155] of X vector is calculated based on Eq. 4.
One denoising autoencoder is formed in this phase. Therefore, using Cost Function, error rate could be estimated based on Eq. 5, which means measuring difference between real inputs and reconstructed inputs. Precision rate of each coded unit is determined by Cost Function. Minimizing the amount of difference between real input and reconstructed input, is the goal here. Next, model parameters are randomly initialized and then optimized using gradient descend algorithm. The best value of X vector, is the one that costs the least. X vector is forced to have smarter mapped than X vector, so that in situation where there's lots of noise, this method is able to extract useful characteristics and remove their noise while reconstructing.
Cost Function tries to penalize the network whenever it makes a mistake. After establishing network, foresight precision must increase while error rate decreases. The most optimized output, is the one that costs the least. In order to increase network's learning ability and decrease its error, numbers of denoising autoencoder must be increased. Figure 4 shows training algorithm of stack denoising autoencoder.
In fact, each autoencoder represents a hidden layer containing a few hidden units in which encryption, decoding, weight determination and bias operations takes place and finally, X vector is output of each hidden unit. After adding a denoising autoencoder to the network, previous hidden layer information is transferred as an input to this layer and non-linear transmission among consecutive hidden layers cause learning of structure and next, the resulted network could foresee flight delay. Therefore, with computing cost.
Function, rate of error between real output and predicted output can be computed. Finally, assigning the network to the two sets of training and testing, in case forecast accuracy are increasing in both series, a denoising auto encoder will be added to network again. Otherwise, if training accuracy increase while test set decreases, it shows that training series has estimated the noise in data and learned the noise-related behavior. Therefore, denoising autoencoder addition operation is ended and stack of denoising auto encoder structure [154] is finally formed against noise.

Third phase: model optimization with Levenberg-Marquart (LM) algorithm
The third phase's goal is model optimization. Figure 5 show supervised fine-tuning algorithm of stack denoising autoencoder. When a network is formed, Weight and Bias values are distributed randomly among the nodes. After determining the output, with its help, network error could be computed and then return the value along with Cost Function chart back to the network to update network's Weights. These Weights are updated in the way that decreases similar errors. This action is called back-propagation. In backpropagation, network's movement is backwards, errors and charts return to the hidden layer so that Weights are updated based on them.
The last hidden layer's output is taken as input to a supervised learning algorithm to fine-tune all the parameters of this deep architecture with respect to the supervised criterion [156]. In this phase parameters are fine-tuned, we use the LM [157] on top of the whole network to train the input generated by the last autoencoder. The LM Algorithm can provide a numerical solution to the nonlinear problem minimizing a function over a space of the function parameters [158] and also it is stable and can generate good convergence [157,159]. LM algorithm has benefits of gradient descent and Gauss-Newton methods at the same time and is created from linear combination gradient descent and Gauss-Newton based on adaptive rules [160]. This algorithm interpolates between gradient descent and Gauss-Newton and in most cases, it finds an answer, even if it started off from farthest final minimum. This algorithm is stronger than Gauss-Newton but in some occasions where initial parameters are logical and function behavior is compatible, it's a little slower than Gauss-Newton. It's also one of the most popular curve fitting algorithms that its main usage is in least squares [161]. This algorithm has two main phases [162]; Computing the Jacobian Matrix that is the most complicated part of this algorithm, and calculate the Hessian matrix and updating Weights which Network's error is computed in this phase. According to the update rule, if the error goes down, which means it is smaller than the last error, it implies that the quadratic approximation on total error function is working and the combination coefficient μ could be changed smaller to reduce the influence of gradient descent part (ready to speed up). On the other hand, if the error goes up, which means it's larger than the last error, it shows that it's necessary to follow the gradient more to look for a proper curvature for quadratic approximation and the combination coefficient μ is increased.
After learning is finished by LM algorithm, it's time for choosing activation Function for the last layer so that it could foresee precisely. At the last layer, a logistic sigmoid function is used because the final output should be a binary class which is 0 and 1. After determine optimized values for weight and bias, we expect network's foresight improve and the proposed model get as close as possible to reality. After prediction delay for an airport, the delay cause and whether it was a delay in the source or the destination will be determined.

Results and discussions
The model is designed using Python in Tensor flow and is installed on a system of 40 core CPU at a frequency of 2.6 hz, 80 G RAM and 250 G Hard. The flight info data is an open dataset collected by the Bureau of Transportation Statistics of United State Department of Transportation [163] where, the reason for delay is due to canceled or flight delay, and time duration of each flight. Model testing and training employs these data that include 18 million records. Model, uses 80% of data for training and the remaining 20% for testing [164]. Finally, the model evaluation considers two analysis which are studied in the following section.

First analysis
In order to evaluate the model, the number of denoising autoencoders and neurons must be determined based on the values for precision, accuracy and time consuming. In order to do this, at first, the model is trained using one stack and 64 neurons, and the precision and accuracy values are calculated. By adding another denoising autoencoder, the values for precision and accuracy are increased; therefore, another stack was added to the model's structure. On the other hand, by adding each stack denoising autoencoder to the structure, the processing time is also risen. Therefore, denoising autoencoder increment process should consider excellence between processing time and number of denoising autoencoder. As a result, adding denoising autoencoder addition is continued until differences of precision and accuracy for previous and newer structure exceed the threshold limit. Figure 6 shows the amount of accuracy based on number of denoising autoencoders and computation time.
After determining the number of stacks denoising autoencoders, it is time to determine the number of neurons. By increasing the number of neurons in each hidden layer, the values for precision and accuracy for both the training and testing sets are evaluated. When the number of neurons increases from 16 to 32 and from 32 to 64, the values for precision and accuracy increased for both datasets; however, by increasing the number of neurons from 64 to 128, the precision and accuracy of the model increased in the training set while they decreased in the testing set. Therefore, increasing the number of neurons was also stopped. The final structure is created with 3 stack denoising autoencoders, 64 neurons and 4 hidden layers.

Second analysis
The data classified in two classes of 0 and 1. The data in Class 0 include 15 million records for non-delayed flights and the data in Class 1 include 3 million records of delayed flights. Due to the imbalance of the datasets, the model was trained by the imbalanced and balanced datasets separately, and then the effects of each mode on the evaluation parameters were evaluated separately. In order to create balance in the dataset, we have to use sampling methods; undersampling and upsampling are two famous sampling methods [165]. In undersampling method, it is required to class zero data to 3 million and increase the classes to 15 million in upsampling. Whenever the upsampling method is used it is required to create 12 million chaos data that cause increment in processing time, reduction in process velocity model overfitting and finally leads to lower confidence of the model, so that it is required to use undersampling method. The proposed operation is measured by confusion Table 3 [166]. Each column of the table shows predefined samples.
These four criterions in the confusion table show the essence for quality of algorithms that perform forecasting. Table 4 shows how to solve evaluation problems such as precision, accuracy, sensitivity, recall and F-measure [166].
Moreover, in each measurement there are several micro and macro averages that are slightly different. Macro Average measurement is computed for each class and then their average is equally computed by considering all classes, while in computing average measurement for micro average adds share of all categories and finally weight average performs averaging according to amount of data in each class. Table 5 shows how to solve evaluation problems such as micro avg, macro averages and weighted avg [167]. In addition, it is assumed that the delay means true Possible delay, and it is expected the proposed method has greater precision and accuracy in comparison to previous methods.
In order to study effect of stack denoising autoencoder and LM algorithm on the model structure, two other structures are also designed. First structure is based on autoencoder and LM algorithm (SAE-LM), and the second structure is based on denoising autoencoder only (SDA). The first stage, trains three imbalanced model and the results of comparison is represented in Table 6.
Afterwards, in order to study the effect of balanced dataset on evaluation parameters, trains three balanced model and the results of comparison is represented in Table 7.
As it is shown in Table 7, balanced dataset has increased all values for evaluation. Moreover, all the evaluation parameters of the proposed model have increased over models. Therefore, effect of stack denoising autoencoder on noisy data is positive and  increase in precision and accuracy in the structure. On the other hand, SDA model shows that optimization through LM algorithm is suitable for solving non-linear problems and achieving a stable model with good degree of convergence. Figure 7 shows the evaluation parameters for three structures; SDA-LM, SAE-LM and SDA. Moreover, proposed model accuracy with imbalanced dataset had increased 4.1% compared to maximum accuracy in the previous model which is based on RNN [24]. This value also is approached to 8% after balancing the dataset. In Fig. 8, the accuracy of the SDA-LM, SAE-LM and SDA structures is compared with the structure of RNN [24,25]. As shown in Fig. 8, the accuracy of the proposed model is increased relative to the accuracy of the previous model which is based on RNN [24].
Finally, accuracy of the proposed prediction model is compared to other previous prediction methods. As you can see in Table 8, the accuracy of the proposed model is higher than other methods.  At the end, for evaluate the validity of the proposed model and the results from training, we evaluate the standard deviation of all the parameters after the 30 times repetition. The smaller the standard deviation of the data That is, the data are closer  to the average and the results are less scattered and therefore more reliable Tables 9  and 10 show the standard deviation of the evaluation parameters using imbalanced and balanced datasets for the three structures of the current study, respectively. As can be seen from Table 9, the standard deviation for all the evaluation parameters is a small number, and using the balanced dataset, this value is reduced further. Therefore, the balanced dataset has a positive impact on the standard deviation and reduces it, as shown in Table 10.

Conclusion
Predicting flight delays is on interesting research topic and required many attentions these years. Majority of research have tried to develop and expand their models in order to increase the precision and accuracy of predicting flight delays. Since the issue of flights being on-time is very important, flight delay prediction models must have high precision and accuracy. In this study, we proposed a novel optimized forecasting model based on deep learning which engages LM algorithm. Afterwards, two other structures are created to study and validate the positive effect of denoising autoencoder and LM algorithm, which one has deleted denoising autoencoder and the other has omitted LM algorithm. Moreover, we have imbalanced dataset which should be balanced. We used undersampling and upsampling technique to balance the data. However, results show that upsampling leads to overfitting. Therefore, under sampling is used for balancing. Comparing the three models for two of imbalanced and balanced datasets shows that accuracy of SDA-LM model with imbalanced dataset respectively is greater by 8.2 and 11.3% Than SAE-LM and SDA models. On the other hand, these values for balanced datasets are respectively as 10.4 and 7.3%. Therefore, using stack denoising autoencoder and LM algorithm in optimizing the results, and also balancing the dataset, has positive effect on delay forecasting and leads to increment in accuracy and precision of SDA-LM model with imbalanced dataset is greater by 6.1 and 5.4% than SAE-LM and SDA models. Whereas, the accuracy of the SDA-LM model with balanced dataset is greater by 10% than SAE-LM and SDA models and the amount of precision is the same for all three models with balance dataset.
At the next stage, the model has been evaluated and computed for subjects of discarding with a standard deviation for all evaluation parameters during 30 times of model run. The results, shows that standard deviation for all balanced evaluation parameters is lower than the imbalanced form. Therefore, data balance leads to lower standard deviation. amount of model standard deviation for imbalanced dataset is 0.045 while this value is reported 0.21 for balanced dataset which is a small value and means that scattering results are low and close to average. Finally, we compared the accuracy of the proposed Model against SAE-LM, SDA and RNN [24,25] models. Using our experimental results, we show that accuracy of the model on imbalanced dataset is 92.1% and for balanced dataset is 96.2%, which is respectively greater by 4.1 and 8.2% respectively. Therefore by proposed model has greater accuracy in forecasting flight delay compared to previous model called RNN [24,25]. The next step would be to apply this technique on other data sets or on other sampling data and investigate the accuracy. Activation function: An activation function determines the output behavior of each node, or "neuron" in an artificial neural network.
Overfitting: A model overfits the training data when it describes features that arise from noise or variance in the data, rather than the underlying distribution from which the data were drawn. Overfitting usually leads to loss of accuracy on out-of-sample data.
Dropout: Dropout changed the concept of learning all the weights together to learning a fraction of the weights in the network in each training iteration.
Epoch: in neural networks generally, an epoch is a single pass through the full training set.
Supervised learning: Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.
Unsupervised learning: Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision.
Fine-tune: Fine tuning is a process to take a network model that has already been trained for a given task, and make it perform a second similar task.
Precision: precision is the ration of system generated results the correctly predicted positive observations (True Positive) to the system's total predicted positive observations, both correct (True positive) and incorrect (False Positives).
Recall: Recall is the ratio of system generated results that correctly predicted positive observations (True positives) to all observations in the actual malignant class (Actual positives).
Accuracy: Accuracy is the most intuitive performance measure and is simply a ratio of the correctly predicted classifications (both True Positives + True Negatives) to the total Test Dataset.
Fi measure: the F1 Score is the weighted average (or harmonic mean) of Precision and Recall. Therefore, this score takes both False Positives and False Negatives into account to strike a balance between precision and Recall.
Specificity: Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).