Prediction of flight departure delays caused by weather conditions adopting data-driven approaches

In this study, we utilize data-driven approaches to predict flight departure delays. The growing demand for air travel is outpacing the capacity and infrastructure available to support it. In addition, abnormal weather patterns caused by climate change contribute to the frequent occurrence of flight delays. In light of the extensive network of international flights covering vast distances across continents and oceans, the importance of forecasting flight delays over extended time periods becomes increasingly evident. Existing research has predominantly concentrated on short-term predictions, prompting our study to specifically address this aspect. We collected datasets spanning over 10 years from three different airports such as ICN airport in South Korea, JFK and MDW airport in the United States, capturing flight information at six different time intervals (2, 4, 8, 16, 24, and 48 h) prior to flight departure. The datasets comprise 1,569,879 instances for ICN, 773,347 for JFK, and 404,507 for MDW, respectively. We employed a range of machine learning and deep learning approaches, including Decision Tree, Random Forest, Support Vector Machine, K-nearest neighbors, Logistic Regression, Extreme Gradient Boosting, and Long Short-Term Memory, to predict flight delays. Our models achieved accuracy rates of 0.749 for ICN airport, 0.852 for JFK airport, and 0.785 for MDW airport in 2-h predictions. Furthermore, for 48-h predictions, our models achieved accuracy rates of 0.748 for ICN airport, 0.846 for JFK airport, and 0.772 for MDW airport based on our experimental results. Consequently, we have successfully validated the accuracy of flight delay predictions for longer time frames. The implications and future research directions derived from these findings are also discussed.


Introduction
With the increasing demand for air travel, the number of air passengers has significantly increased.The global air passenger transport market doubles every 15 years [1].For example, as of February 2023, the revenue passenger kilometer in Asia Pacific and North America has increased by 105.4% and 25.1% relative to that in 2022, respectively.Despite a temporary decline in passenger traffic during the Covid-19 pandemic, the number of air passengers has steadily increased over the past few decades [2].
Meeting the increasing demand for air travel and ensuring efficient supply chain operations require the development of aviation infrastructure.This includes expanding airport facilities, updating airline fleets, and implementing effective air schedule management.Addressing these issues is crucial to provide a seamless and reliable travel experience for passengers.However, a significant challenge in delivering satisfactory services is the frequent occurrence of unexpected flight delays and cancellations [3].
According to Tileagă and Oprisan [4], the number of compensation cases due to delayed flight schedules is increasing steadily.Table 1 shows that the number of compensation recipients for air delays and cancellations is steeply increasing every year.Flight delays have significant economic consequences for both airlines and passengers, rendering it a notable issue within the aviation industry.
Table 2 show the types and proportion of delays from 2010 to 2021 at the John F. Kennedy International Airport (JFK).It reveals that weather-related delays account for a small proportion of delays (3.86%).However, weather-related delays were longer than other types of delays, with an average delay time of 69.81 min and a standard deviation of 100.79 min [5].
The frequency of abnormal weather phenomena that are known to contribute to an increase in flight delays [6] is on the rise worldwide.In addition, the regional climate determined by geographical location plays a significant role in flight operations [7].For example, in South Korea, the total rainfall period is concentrated from July to September each year, with approximately 42.5% rainfall in July, 27.4% in August, and 12.8% in September.In addition, the region is directly affected by typhoons at the end of August through early September every year.
While previous studies on flight delay prediction have often incorporated weather information [8][9][10], the majority of these studies have centered around predicting delays within relatively short timeframes, typically within thresholds of 15 min or up to 4 h, primarily tailored to airline services.However, the unique context of international flights  covering vast distances across continents and oceans, with flight durations spanning from as little as 10 h to as long as 20 h, underscores the necessity for delay prediction over more extended timeframes.Therefore, this study aims to predict flight delays over more extended timeframes (2 to 48 h) based on weather data.We focus on three well-known international airports: Incheon International Airport in South Korea (ICN), John F. Kennedy International Airport (JFK), and Chicago Midway International Airport (MDW) in the United States.In addition, we use weather information from the meteorological agencies located at each airport."Background and related work" section reviews previous research in this area, whereas "Methodology" section presents the machine learning and deep learning models along with the evaluation methods utilized in the study.The experimental procedures and the comparison of the results across the models are presented in "Implementation and result" section."Discussion and concluding remarks" section concludes this paper by presenting the interpretation of the results, noteworthy findings, limitations, and suggestions for future research.

Background and related work
Several studies have been conducted to forecast flight departure delays using various statistical methods, machine learning, and deep learning techniques.Table 3 provides a summary of prior flight delay detection research based on machine learning and neural network approaches.
Researchers [9,11,12] have utilized Bayesian modeling, clustering, classification, and regression with diverse datasets from different regions.The time span of the data varied, ranging from 1 month to 5 years, and the airports under investigation differed as well.Khaksar and Sheikholeslami [9] identified parameters that enable effective estimation of delays.They used Bayesian modeling, decision tree, cluster classification, random forest, and hybrid methods.They used 2,825,647 data for US airlines and 15,428 data for Iranian airlines.They realized an accuracy of approximately 70%.
Al-Tabbakh et al. [11] analyzed the flight delay patterns using four decision tree classifiers, including Decisionstump, J48, Random Forest, and REPTree.They utilized 512 data from a brief duration of 1 month, i.e., January 2018.The findings revealed that among the classifiers evaluated for the Egypt Airline dataset, REPTree attained the highest accuracy score of 80.3%.
Ye et al. and Atlioğlu et al. [12,13] conducted flight delay prediction via supervised learning methods, whereas [12] employed multiple linear regression, a support vector machine, extremely randomized trees, and LightGBM.They used 105,993 data and reported the highest accuracy of 86.53%.
Atlioğlu et al. [13] studied 11 machine learning models using data obtained following feature selection and transformation.They used 8086 data and achieved F1-scores of approximately 81%.
Certain researchers predict airline delay using neural networks and hybrid models [8,10,14].Kim et al. [8] investigated the effectiveness of deep learning models in predicting air traffic delays.Daily sequences of departure and arrival flight delays for individual airports were modeled using the long short-term memory (LSTM) and recurrent neural network (RNN) architecture.The accuracy of RNN improves with deeper architectures,

Results
Machine learning Khaksar and Sheikholeslami [9] Bayesian modeling, decision exhibiting the highest performance with an accuracy of 90.95% on the Atlanta air traffic data.Qu et al. [10] analyzed and predicted flight delays using a convolutional neural network (CNN) and RNN models that are well-suited for classification problems in the field of deep learning.They improved the CondenseNet network by incorporating CBAM modules within the CNN-based CondenseNet algorithm to develop CBAMCon-denseNet.Additionally, they constructed a CNN-MLSTM network based on the CNN model and injected the SimAM module to enhance the attention of flight chain data.They used 36,287 data of China and achieved the highest accuracy score of 91.36%.
Yazdi et al. [14] designed the proposed model to output optimized results by incorporating a technique based on stack denoising autoencoder to account for the noisy flight delay data.They constructed SAE-LM based on an autoencoder and LM algorithm.The stacked denoising autoencoder is based on only denoising autoencoder.They utilized a comprehensive dataset spanning 5 years of US flight operations, comprising a total of 3,601,679 data points.The results demonstrated that the proposed model exhibited enhanced accuracy compared with the RNN model, highlighting its effectiveness in predicting flight delays.While numerous researchers have utilized state-of-the-art machine learning and deep learning techniques to study weather-related takeoff delays from various angles, the majority of studies have focused on predicting delays within a time criterion of approximately 15 min.There has been limited exploration and prediction of flight delays exceeding 2 h.
By employing established research methodologies, it is feasible to aggregate the outcomes of short-term predictions to generate long-term forecasts.Nevertheless, it's vital to recognize that repeated predictions may introduce inaccuracies.When assessing the practical utility of such models, the ability to predict aviation delays over extended time intervals based on input data widens the scope of possibilities for long-haul flights and diverse flight schedules.This expanded capability offers benefits not only from the perspective of airport resource management but also in various other aspects.Therefore, there is a pressing need for research that focuses on machine learning and neural network models capable of forecasting the distant future using authentic long-term differential data.Hence, in this study, our objective is to specifically address and forecast flight delays of more than 2 h.

Classification models
We used the following machine learning models and LSTM neural network to predict flight takeoff delays.The LSTM model boasts the advantage of effectively managing time-series data, but it comes with the drawback of requiring considerably more complex and powerful hardware.From this standpoint, machine learning (ML) models allow predictions at the individual time-unit level and are notably more computationally efficient when compared to the LSTM model.
• Decision Tree (DT): DT is a type of supervised learning model that classifies or regresses data by applying a set of classification rules.The resulting model has a treelike structure, hence the name 'Decision Tree.' Pruning techniques can be applied to enhance the model's generalization performance and prevent overfitting, ensuring that it performs effectively on unknown data.Grid search can be used to find the optimal parameter values for the DT model, optimizing its performance [15].It does not necessitate data preprocessing, such as normalization or handling missing values and outliers.It also has the capability to simultaneously handle both numerical and categorical variables.However, it has the limitation of considering only one variable at a time, which can make it challenging to capture interactions between variables.Moreover, the shape of the resulting decision tree can exhibit significant variations with minor differences in the data [16,17].• Random Forest (RF): RF is an ensemble algorithm that trains multiple DT models and combines their results to make predictions.The method entails the random selection of a subset of features from the complete feature set to build one decision tree, followed by the selection of another random feature subset to create additional decision trees.Multiple decision trees are generated using this process.The final prediction is made by choosing the most frequently occurring prediction from these multiple decision trees [18].This approach is versatile as it can be applied to both classification and regression problems.It is particularly effective in handling largescale data and mitigates the issue of overfitting by reducing model noise, ultimately improving model accuracy [19,20].• Support Vector Machine (SVM): SVM is a powerful supervised learning model that can be used for various tasks such as classification, regression, and anomaly detection.It aims to find a decision boundary that maximizes the separation between two classes while satisfying certain conditions.SVM can handle both linear and non-linear classification problems by using different kernel functions [21].It determines the side of the decision boundary to which a data point belongs, allowing it to effectively classify data.Although it may be slower and less interpretable due to the requirement for multiple combination tests, it offers the advantage of being applicable to both categorical and numerical prediction problems, with minimal vulnerability to outlier data.Additionally, it is less susceptible to overfitting and more user-friendly compared to neural networks [22,23].• K-Nearest Neighbors (KNN): KNN is a classification algorithm that operates based on the principle of similarity.It assigns a class label to a given data point by considering the labels of its "k" nearest neighbors in the feature space.The distance between data points is typically calculated using the Euclidean distance measurement method [24].It offers several advantages, such as high accuracy and the ability to exclude outlier data from consideration by using only the top k closest data points.Furthermore, it does not rely on assumptions about the data since it is based on existing data.However, it has the disadvantage of increased processing time as the dataset size grows, as it needs to compare with all existing data points, and it may require significant memory usage for large datasets [22,25].• Logistic Regression (LR): LR is one of the simplest classification models.It predicts the probability of data belonging to a certain category as a value between 0 and 1 and classifies it into the category with a higher probability [26].It has the advantage of being less complex and faster due to linear combinations, making it easy to interpret the results.However, it may suffer a reduction in learning ability when dealing with non-linear relationships and can be sensitive to outliers and anomalies, which are its disadvantages [27,28].• Extreme Gradient Boosting (XGB): XGB is an algorithm implemented using the boosting technique.It supports both regression and classification problems and exhibits suitable performance and resource efficiency.It is characterized by strong durability with its built-in overfitting regularization function [29,30].• Long Short-Term Memory (LSTM): LSTM networks are a type of RNN that can learn the order dependence in sequence prediction problems.RNNs are modified by adding a memory cell that can store information for an extended period.LSTM was proposed as a solution to address the issue of vanishing gradients in RNN when processing long sequential data [31].However, it has the drawback of being computationally intensive and having a complex model structure due to the incorporation of forget gates, input gates, and output gates [32][33][34].

Evaluation methods
To evaluate the performance of each classifier, we calculated the confusion matrix and measured the accuracy, precision, recall, and F-score.Table 4 is the confusion matrix, a 2 × 2 matrix representation of classification results.The number of correctly classified instances is the sum of the diagonals of the matrix, while all other instances are incorrectly classified.Each item in the confusion matrix includes the following four indicators.The first indicator is True Positive (TP), which signifies that the predicted value is positive when the actual value is positive.The second indicator is True Negative (TN), indicating that the predicted value is negative when the actual value is negative.The third indicator is False Positive (FP), denoting that the predicted value is positive when the actual value is negative.Lastly, the fourth indicator is False Negative (FN), showing that the predicted value is negative when the actual value is positive [35].
Accuracy serves as "a metric for assessing the overall performance of each model by computing the ratio of correctly classified samples to the total number of samples" [36].However, in situations with a significant imbalance between positive and negative samples, accuracy may not provide a suitable evaluation measure.Precision presents "the proportion of true positive cases among all predicted positive cases" [37], while recall computes "the ratio of correctly predicted positive samples to the total number of true positive samples" [38].F1-score represents "a balanced measure that combines both precision and recall" [39].

Data description and analysis
We collected three datasets including flight and weather information of Incheon International Airport in South Korea (ICN) [40], John F. Kennedy International Airport (JFK) [41], and Chicago Midway International Airport (MDW) [42] in the United States.The flight information [43,44] is organized by all flight-related features, including scheduled departure time, actual departure time, and delay type.The weather information is the officially introduced regional weather feature.The flight information scheduled from 2010 to 2021 was examined, spanning a total of 11 years.The weather information corresponding to the same period was also collected.For the experiment, weather and flight information were merged with a time difference for data preprocessing to predict flights based on weather conditions.The merged datasets include the attributes listed in Tables 5 and 6.Among these attributes, the airline, flight number, and destination were not used in the actual model training.Additionally, since the features wind direction (e.g., NW, WNW) and condition (e.g., Cloudy, Windy) are categorical data, they were transformed into one-hot encoding before being included in the training dataset.

ICN dataset
In situations where the scheduled departure time differs by more than 1 h, we classify the data as delayed.The ICN dataset comprises 1,562,029 instances of normal flights and 7850 instances of delayed flights caused by weather conditions.To achieve a balanced distribution between normal and delayed cases, we randomly sampled an equal number of normal and delayed flight instances.To address the absence of certain features in the cases, we utilized a data interpolation method that was previously validated in a research study [45].Due to the hourly-based nature of the ICN weather information, there were instances of missing features.To fill these gaps, we employed a linear interpolation technique to estimate the values for the unmeasured time periods.The interpolated data comprises 953 data points, which accounts for 0.9% of the total 105,192 data points.Furthermore, we included flight takeoff results with time differences as additional features.
To fulfill the objectives of the present study, we implemented a time difference criterion and utilized combined flight and weather cases.The time differences were categorized into intervals of 2, 4, 8, 16, 24, and 48 h.

JFK dataset and MDW dataset
Similar to the ICN dataset, we created delayed flight instances for the JFK and MDW datasets based on the time difference between the scheduled and actual departure times.
The JFK dataset consisted of 763,930 normal cases and 9417 delayed cases attributed to weather conditions, while the MDW dataset comprised 398,945 normal cases and 5562 delayed flight instances.Similar to the approach followed for the ICN dataset, we conducted down-sampling procedures to achieve a 1:1 ratio of normal and delayed cases.
In both the JFK and MDW datasets, the weather information consists of several categorical features, such as wind direction and condition details.To incorporate these features into our data-driven approaches for machine learning and neural network frameworks, we employed a one-hot encoding technique.This encoding method allows us to represent the categorical variables as binary vectors, facilitating their utilization in the models.Additionally, we included flight takeoff results with time differences as one of the features in the dataset.Subsequently, both the JFK and MDW datasets with weather information were merged.

Experiment
Figure 1 shows the flow chart of our overall approach.For machine learning models, we input the data sampled following the process as mentioned above, while we stack the sampled data to create time-series data and input them to the LSTM model.
To begin, we partitioned the dataset into subdata and testing subsets in an 80:20 ratio.Subsequently, we further divided the subdata into training and validation subsets in an 80:20 ratio, resulting in a distribution of the training, validation, and test datasets with a ratio of 67:13:20.Table 7 presents the number of datasets used for training, validation, and testing.All experiments were conducted on a single GeForce RTX 3080 Ti 10GB GPU and implemented using Python 3.6 as the programming language.We performed a grid search to determine the optimal hyperparameters, including learning rates, number of epochs, number of layers, and number of stacked time-series data.We selected the most optimal parameters for the best performance.Tables 8 and 9 show the list of hyperparameters for DT and LSTM used in the grid search.In the case of the LSTM model, the training parameters varied for each airport dataset.The ICN dataset had 2,385 parameters, while the JFK and MDW datasets had 2,833 parameters.

Flight delay prediction
Tables 10, 11 and 12 show the prediction results of flight departure delays based on weather data using various models.The results were obtained corresponding to a total of six different time differences (2,4,8,16,24, and 48 h).
Table 10 summarizes the results of the ICN dataset.The RF model reported the highest accuracy score of 0.749 with a time difference of 2 h.Except for the DT model that showed the best recall performance of 0.700, the RF model displayed superior performance in other metrics.
For the JFK airport dataset with a time difference of 2 h, the LSTM model achieved the highest accuracy score of 0.852 (Table 11).In terms of recall for predicting flight delays, the DT model outperformed all other models (0.826), whereas in terms of precision of prediction of on-time flights, the RF model outperformed all other models (0.835).Nonetheless, the LSTM model demonstrated superior performance in other evaluation metrics.
The result corresponding to the MDW airport dataset for a time difference of 2 h is presented in Table 12.The LSTM model achieved the highest accuracy score of 0.785.Although the DT model exhibited the best performance in terms of recall (0.759), the LSTM model outperformed the other models in all other evaluation metrics.

Flight delay prediction (1 to 24 h, hourly)
Tables 13, 14 and 15 provide an hourly breakdown of model accuracy from 1 h to 24 h, utilizing the same three datasets for ICN, JFK, and MDW airports, along with average training and testing times.The hyperparameters that yielded the best performance in the prior experiments were applied.Across all three airport datasets, the highest accuracy was observed at a 1-h time difference, with a declining trend in performance as the time difference increased.The magnitude of performance decline from 1 h to 24 h for each model is detailed in Table 16.Notably, the Random Forest model exhibited the least performance degradation, with a decrease of only − 3.6%, while the SVM model showed the most significant performance decline, with an average decrease of − 16.1%.Machine learning models completed their training in just a few seconds, while LSTM required several 100 s, indicating it was approximately 100 times more time-consuming.In terms of testing time, it ranged from as low as 1 ms to a maximum of around 1.3 ms.

Ablation study
We conducted training on the ICN dataset with identical parameters and training strategies, except for the exclusion of linear interpolation, while examining a time difference of 2 h.The results, as depicted in Table 17, reveal a slight reduction in overall performance, ranging from 1 to 2%, when interpolation was omitted.It is noteworthy that the interpolated data constitutes only 0.9% (953 out of 105,192) of the entire dataset, which lends credibility to the decision to incorporate linear interpolation in our research.

Feature importance
To determine the features with a substantial impact on our models, we conducted feature importance analysis.We chose the Random Forest and LSTM models, which demonstrated the best performance.For the Random Forest model, we made use of the built-in feature importance function, whereas for the LSTM model, we employed external algorithms using loss data.Consequently, in the case of Random Forest, higher values correspond to greater feature importance, whereas for LSTM, lower values signify reduced importance.Considering the results of the ICN airport dataset, Random Forest attributed the highest importance to temperature, dew point, and weather phenomena in that order, while LSTM assigned the highest importance to temperature, wind speed, weather phenomena, and local pressure.Notably, temperature was identified as the most crucial feature in both models (Table 18).For the JFK airport dataset, Random Forest identified pressure, temperature, and dew point as the most important features, while LSTM emphasized pressure, precipitation, and wind speed as the top influential factors.Notably, pressure was recognized as the most crucial feature in both models for this dataset (Table 19).
In the case of the MDW airport dataset, Random Forest indicated that pressure, humidity, and temperature were the top features in terms of importance, while LSTM emphasized pressure, precipitation, and wind speed as the most influential factors.Notably, pressure was consistently identified as the most important feature in both models for this dataset (Table 20).

Comparison with prior approaches
We conducted a performance comparison between our models and a prior research model [8].Using the same JFK airport dataset, we compared our research's Random Forest and LSTM models with the prior research model's LSTM model.Our Random Forest model achieved an accuracy of 84.3% with a 2-h time difference and 84.6% with a 48-h time difference.In contrast, the LSTM model in our research achieved an accuracy of 85.2% with a 2-h time difference and 73.6% with a 48-h time difference.It's worth noting that the previous model exhibited a performance of 86.51% at a short time interval of 15 min.

Discussion and concluding remarks
For predicting flight takeoff delays using weather information for the airports of ICN, JFK, and MDW, machine learning and LSTM models were employed.Based on the prediction results for the three regions, the RF model demonstrated the highest performance for the ICN airport, while the LSTM model exhibited the highest performance for JFK and MDW airports, with a minimum time difference of 2 h.The accuracy scores were 0.749 for ICN, 0.852 for JFK, and 0.785 for MDW airports.Moreover, the RF model also displayed the best performance with high accuracy for all three airports, with a maximum time difference of 48 h; the accuracy scores were 0.748 for ICN, 0.846 for JFK, and 0.772 for MDW airports.Moreover, when assessing test times, all of the models require less than 2 ms, which makes them suitable for real-time predictions.These findings confirm the feasibility of predicting flight takeoff delays using weather data collected 2 h prior to the scheduled departure time.
Our analysis incorporated datasets spanning from 2011 to 2021, encompassing a long time period.This extensive dataset allowed us to leverage both actual flight operation data and weather information for our analysis.By utilizing these comprehensive datasets, our proposed models exhibited outstanding performance in predicting delayed flights across three different datasets.The utilization of a long-term dataset facilitated robust predictions and enhanced the reliability of our models.Furthermore, the approaches we developed can be applied to various other transportation-related domains, including ocean vessel delays, vehicle operation restrictions, and outdoor construction work stoppages.In these application areas, early-stage warnings play a crucial role in mitigating potential risks to human safety and property damage.By leveraging our proposed models, it becomes feasible to anticipate and prepare for potential disruptions, enabling proactive measures to be taken in advance.This can significantly contribute to minimizing the adverse impacts associated with delays and restrictions in these transportation-related sectors.The presented implications notwithstanding, it is important to acknowledge the presence of notable limitations.One such limitation is the significant influence of national and regional factors on weather conditions, rendering it challenging to generalize the results to other locations.The generalization of findings beyond the specific context may not be straightforward owing to these variations.Furthermore, the performance of the ICN airport dataset was relatively lower compared with the JFK and MDW airport datasets.This discrepancy in performance could be attributed to several factors, including the presence of missing features in the dataset.The absence  In future research, our aim is to develop a more robust model that incorporates geographic information, enabling its application to other airports beyond the specific datasets analyzed in this study.

Fig. 1
Fig. 1 Flow charts for a machine learning, and b LSTM models

Table 1
Number of eligible passengers for compensation versus the number of total passengers

Table 2
Different types of delays

Table 4
Confusion matrix

Table 5
Incheon International Airport's attributes list

Table 6
John F. Kennedy International Airport, and Chicago Midway International Airport's attributes list

Table 7
Summary of the employed datasets in training, validation, and test sessions

Table 8
Tested parameters in DT

Table 9
Tested parameters in LSTM model

Table 10
Results of ICN airport

Table 11
Results of JFK airport

Table 12
Results of MDW airport

Table 18
Feature importance of ICN airportBold valuesindicate the greatest results

Table 19
Feature importance of JFK airport

Table 20
Feature importance of MDW airportBold valuesindicate the greatest results