- Open Access
Development of a predictive model for on-time arrival flight of airliner by discovering correlation between flight and weather data
Journal of Big Data volume 6, Article number: 85 (2019)
An important business of airlines is to get customer satisfaction. Due to bad weather, a mechanical reason, and the late arrival of the aircraft to the point of departure, flights delay and lead to customer dissatisfaction. A predictive model of on-time arrival flight is proposed with using flight data and weather data. The key research in this paper is to discover the correlation between flight data and weather data. The relation between pressure pattern and flight data of Peach Aviation, which is LCC (low-cost carrier) in Japan, are clarified, and it is found that the sea-level pressures of 3 weather observation spots, which are Wakkanai as the most northern spot, Minami-Torishima as the most eastern spot, and Yonagunijima as the most western spot, can classify the pressure patterns. As a result, on-time arrival fight is predicted at 77% of the accuracy with using Random Forest Classifier of machine learning. Furthermore, feasibility of the predictive model is evaluated by developing a tool of on-time arrival flight prediction.
Ministry of Land, Infrastructure, Transport and Tourism  reported Japanese airliners’ performance on 2017. On-time performance rate refers to flights which are operated and departing within 15 min of the scheduled departure time. The departure time means the time of the block out. In other words, it’s the time the aircraft started to move. Flight cancellations are not reflected in on-time performance rate. So, the delayed rate refers to flights departing later than 15 min of the scheduled departure time. The cancellation rate is the percentage of canceled flights relative to the number of scheduled flights, which is the flight scheduled to be operated on the day and is different from the suspension of canceling the flight in advance. Peach Aviation is LCC (low-cost carrier) of commercial passenger jet airliners for short and medium distance and uses AIRBUS A320-200 type machine . Peach operates on-time at 79.17%. The delayed flight, which is arrived on more than 15 min from arrival estimated time, is operated at 20.83%. The causes of delay are 1.29% for weather, 0.62% for a mechanical reason, 14.99% for the late arrival of the aircraft to the point of departure, and 3.94% for other. As for the canceled flight, It operates the canceled flight at 1.32%. The causes of cancel are 1.16% for weather, 0.04% for a mechanical reason, 0.06% for the late arrival of the aircraft to the point of departure, and 0.07% for other.
There are two reasons why a flight may be delayed or canceled. One is the factor of company circumstances. Another is the factor of the inevitability. As the company circumstances, the airline is responsible for machine parts trouble and the system failure. As the inevitability, there is the natural disaster including a heavy snow and the typhoon. So, on-time flight is focused on and on-time flight prediction is developed because the flight performance is not affected by the company circumstances and the inevitability. The key research in this paper is to discover the correlation between flight data and weather data. Flight data set of FLIGHTSTATS  and weather data of Japan Meteorological Agency  from March 1, 2012 to December 3, 2018 are gathered. FLIGHTSTATS also offers on-time arrival performance of flight which does not reach 15 min delay. By using those data resources, the feature of flight performance and pressure patterns of weather is analyzed, and the sea-level pressures of 3 weather observation spots, which are Wakkanai as the most northern spot, Minami-Torishima as the most eastern spot, and Yonagunijima as the most western spot, are selected as pinpoint of weather data in order to extract the features of weather related to pressure pattern as this method of selecting pinpoint data is useful in big data analytics to discover the correlation among data . The relation between those pinpoint data and the pressure pattern can be proved. As a result, a predictive model of on-time arrival flight is developed by using the flight data and the weather data. The model is evaluated by flight data and weather data from January 1, 2019 to January 31, 2019. Furthermore, a tool is developed to evaluate the feasibility of the model.
The following sections of the paper are organized as follows: the “Literature review” section will describe related works of flight prediction. The “Methodology” section will outline the discovery of correlation between flight data and weather data, and the development of on-time flight prediction. The performance and the feasibility of the model will be demonstrated in “Experimental result and discussion” section. The finally, “Conclusion” section will describe a direction for future.
A lot of research on flight delay has been done. Flight delay prediction, its analysis, and its causes have been an important topic for air traffic control, airline decision-making and ground delay response programs. In the United States, transit at hub airports is distinctive. Delay model studies in delay propagation of the sequence have been conducted. In order to use the delay model for prediction, it has been developed to study predictive models. Moreover, the study of predictive model for arrival delay and departure delay with weather data has been promoted.
In , Kim et al. was focusing on flight sequence, and the proposed Recurrent Neural Networks (RNN) predicted departure and arrival delay with using weather data of National Oceanic and Atomospheric Administration (NOAA). This prediction showed the accuracy of delayed flight prediction is 91.81% of McCarran International Airport (LAS) to 71.34% of Sky Harbor International Airport (PHX) because of the difference in data volume. In , Choi et al. were focusing on a relation between weather, which data was gathered from NOAA, and flight delay. As a result, the proposed Random Forest, which is an ensemble learning method, predicted 80.36% of arrival delay. In , Belcastro et al. predicted the arrival flight delay due to bad weather by Random Forest in MapReduce. This study, which used weather data of NOAA, showed the result of prediction as follows: (1) with a delay threshold of 15 min, an accuracy of 74.2% and 71.8% recall on delayed flights are achieved, and (2) with a threshold of 60 min, the accuracy is 85.8% and the delay recall is 86.9%. In , Thiagarajan et al. proposed a predictive model by Gradient Boosting, which is a machine learning technique for regression and classification with weather data of the World Weather Online API consisting of weather attributes of origin and destination airports. They predicted 94.35% of arrival delay and 86.48% of departure delay. Outside the United States, in , Prasad et al. predicted departure and arrival delays by Decision Tree, which uses a tree-like model of decisions and their possible consequences, in domestic flights of Brazil with weather data of the same data resource as that of Thiagarajan. In arrival delay by Decision Tree classifier, the accuracy is 78%. In departure delay by Regression classifier, the accuracy is 77%.
Most researches focused on departure or arrival delay by using machine learning method. In this paper, the data model of predicting on-time arrival flight is designed by discovering the data correlation between flight data and weather data by big data analytics. The contribution is to find the pinpoint data of the sea-level pressure in the relation to the flight features. A predictive model that can explain the rationale of data is developed and implemented by using machine learning.
In 2018 route map , Peach flights use the domestic airports of Kushiro Airport (KUH), Sapporo (Shin-Chitose) Airport (CTS), Sendai Airport (SDJ), Niigata Airport (KIJ), Haneda (Tokyo International airport) airport (HND), Narita International Airport (NRT), Kansai International Airport (KIX), Matsuyama Airport (MYJ), Fukuoka Airport (FUK), Nagasaki Airport (NGS), Miyazaki Airport (KMI), Kagoshima Airport (KOJ), Naha Airport (OKA), and Ishigaki Airport (ISG). And they use the international airports of Incheon International Airport (ICH), Gimhae International Airport (PUS), Hong Kong International Airport (HKG). Taoyuan International Airport (TPE), Kaohsiung International Airport (KHH), Shanghai Pudong International Airport (PVG), and Suvarnabhumi Airport (BKK). Peach flight data of this route can be extracted from data resource of FLIGHTSTAT. Nine features of departure date (year, month, day, hour, minute), arrival date (hour, minute), departure airport, and destination airport are extracted from flight data resource of FLIGHTSTATS. And the names of departure airport and destination airport are converted to each latitude (degree, minute, second) and each longitude (degree, minute, second). So, nineteen features of flight data are obtained. FLIGHTSTATS defines on-time arrival flight as a flight arrived as less than 15 min from arrival estimated time.
Those features are analyzed and classified according to the pressure patterns extracted from the weather conditions on days with the most on-time arrival, delayed, and canceled flights. Tables 1, 2 and 3 indicate the result of the classification. It is found out that days with the most on-time arrival flight is in high pressure, that the most delayed flight is in winter pressure pattern, which high pressure lies to the west and low pressure to the east, and that the most canceled flight is in typhoon pressure pattern. Figure 1 shows the transition of the number of flight from March 1, 2012 to December 3, 2018 which is a period of learning data. As the number of flights has increased, it is likely to be a day with many scheduled arrivals, delays, and cancellations data for fiscal 2018.
Sea-level pressure is focused on in order to extract the characteristic of the pressure pattern that a lot of flight delay and cancel are occurred, and because there was a study to classify pressure patterns from sea-level pressure . The weather observation spots of sea level pressure, which indicate the characteristics of the weather map, are selected as pinpoint. Figure 2 shows the north, south, east and west endpoint in the weather observation spots. Because Oki-no-Tori Shima is not selected because there is no data to be opened. Data of sea-level pressure per hour are extracted from three weather spots of Wakkanai situated at latitude 45\(^\circ\) 24.9 min north and longitude 141\(^\circ\) 40.7 min east, Minami-Torishima situated at latitude 24\(^\circ\) 17.3 min north and longitude 153\(^\circ\) 59.0 min east and Yonagunijima situated at latitude 24\(^\circ\) 28.0 min north and longitude 123\(^\circ\) 0.6 min east. In addition to that, data of temperature, humidity and wind direction per hour are also collected. As for wind direction, 16 directions from north to north northwest are converted to integer values of 1 to 16, and quietness is converted to an integer value of 0 (Fig. 3).
Those three weather observation spots have a typical feature between pressure pattern and sea-level pressure. Weather map and sea-level pressure a day on average of days with many numbers of on-time arrival, delay, and cancel are extracted from data resource of Japan Meteorological Agency. Table 4 shows weather pattern and sea-level pressure with Figs. 4, 5, 6, 7, and 8 which are weather maps according to the weather pattern. In winter pressure type, sea-level pressure of Yonagunijima is higher than those of two spots in Fig. 6. In typhoon type, sea-level pressure of Yonagunijima is lower than those of two spots in Fig. 8.
In big data mining , first, big data with characteristics called 3 V is collected. 3 V is Volume which is data volume, Variety which is diversity of data to handle, and Velocity which is speed to generate data. In addition to 3 V, Veracity which is the data accuracy and Value which is data used for decision making are collected as 5 V. Second, those collected data is analyzed to find some pattern such as pattern recognition and predictive analysis. Third, any correlation between collected data or classified data should be discovered. From the correlation, it is possible to find out the meaning and value of data by discovering the deep knowledge behind it.
Figure 3 shows the result of correlation analysis between data sets of learning data by R . Those data sets are departure date (year, month, day, hour, minute), arrival date (hour, minute), latitude of departure airport (degree, minute, second), longitude of departure airport (degree, minute, second), latitude of destination airport (degree, minute, second), longitude of destination airport (degree, minute, second), temparature (Wakkanai, Yonagunijima, Minami-Torishima), wind direction (Wakkanai, Yonagunijima, Minami-Torishima), sea-level pressure (Wakkanai, Yonagunijima, Minami-Torishima), flight result (on-time, delayed, canceled). Some correlated data sets between flight data and weather data can be seen. Moreover, when rounding up after the second decimal place, 7 data sets correlated with flight data and flight results are discovered as followed: (1) a negative relation between hour of departure date and on-time flight, (2)a positive relation between hour of departure date and delayed flight, (3) a negative relation between hour of arrival date and on-time flight, (4) a positive relation between hour of arrival date and delayed flight, (5) a negative relation between second in latitude of departure airport and on-time flight, (6) a positive relation between second in latitude of departure airport and delayed flight, and (7) a positive relation between second in longitude of departure airport and on-time flight.
As for (1), (2), (3), and (4), it can be concluded that the number of takeoffs and landings in an airport affect those relations. Figure 9 indicates the number of takeoffs and landings a day on average in Narita international airport of 2017. The increase of those numbers can be seen according to transition of time. As for (5), (6), and (7), those findings can be concluded that the geographical location of a departure airport affect those relations because of a pressure pattern. In (5), arrival flights from the eastern departure airport tend to be on-time because the eastern part tends to be faced with Pacific high pressure and the western part tends to be faced with typhoon type. In (6) and (7), arrival flights from the northern departure airport can tend to be delayed because the northern part of Japan tends to be faced with winter pressure pattern. So far, it can be concluded that the geographical location of a departure airport affects on-time arrival prediction.
Flight data and weather data are used to propose and create two models of on-time arrival with weather data and without weather data because Japan Meteorological Agency does not offer the future weather data. And four models of delayed arrival with weather data and without weather data, cancel with weather data and without weather data are prepared in order to evaluate the models.
The predictive models are implemented by SVM Classifier, Gradient Boosting Classifier, Random Forest Classifier, AdaBoost in machine learning library of scikit-learn , and those models’ performance are evaluated. In generating each predictive model, balanced prameter was set to the class weight parameter in order to cope with unbalanced data, and hyper parameters were adjusted by grid search.
Experimental result and discussion
This section presents results from the methods chosen and applied on the dataset to implement two predictive models and explains how two predictive models and results are useful for Peach flights.
This experiment is to evaluate the performance of learning model. Two predictive models and four prepared models are evaluated in performance by using the machine learning algorithms. Table 5 shows the results of the predictive model with weather data. Gradient Boosting Classifier and Random Forest Classifier are dominated although there is a difference in precision and recall.
Table 6 shows the results of the predictive model without weather data. Gradient Boosting Classifier and Random Forest Classifier are superior. Gradient Boosting Classifier predominates in precision and Random Forest Classifier predominates in recall. Tables 7, 8, 9 and 10 show the results of the delay model and the cancellation model. Focusing on F-score of each model, the delay model has lower performance than the other models.
So far, on-time arrival predictive models are high, and their performance is the best.
This experiment is to evaluate how long learning period is available for the model in performance. In on-time arrival model, the length of the learning period is evaluated for Gradient Boosting Classifier and Random Forest Classifier with weather data, and Gradient Boosting classifier without weather data. There are two evaluation periods, which are from April 1, 2016 to March 31, 2017, and from March 1, 2012 to December 3, 2018. Tables 10, 11 and 12 show the evaluation results. In all of learning models, the longer period can get the better performance.
In this experiment, new flight data and weather data are used to evaluate the model in performance. The three models of Experiment 2 are evaluated with new flight data and weather data, which are extracted from Jan. 1, 2019 to Jan. 7, 2019, and from Feb. 8, 2019 to Feb. 14, 2019. The evaluation results are shown in Table 13. As for the models with weather data, the accuracy and F-score of Gradient Boosting are the same as that of Random Forest in Experiment 2. However, Experiment 3 can prove that Random Forest model is better than Gradient Boosting model in all indexes of accuracy, precision, recall, and F-score.
From the above experiments, it was found that on-time arrival model with weather data is superior to others. Therefore, it can be concluded that Random Forest Classifier with weather data and Gradient Boosting Classifier without weather data are adopted as the predictive model. If on-time arrival prediction falls under a predictive model by Random Forest Classifier, it is considered as a guide for flight prediction of 77% chance of on-time flight, and otherwise it is any chances of delayed or canceled flight.
Tool of on-time arrival flight prediction
A prototype of a web application using the on-time arrival predictive model is developed on Tomcat server. When weather data is required in this system, it is gathered from data resource of Japan Meteorological Agency. First, the user inputs flight details on the screen showed in Fig. 10. Second, the system checks whether there is weather data according to input data or not. Third, a predictive model of Random Forest Classifier outputs the result of the prediction, which shows a message that is 77% chance of on-time flight when the weather data is existed shown in Fig. 11. When the weather data is not existed, a predictive model of Gradient Boosting Classifier outputs the result of the prediction, which shows a message that is any chances of delayed and canceled flight shown in Fig. 12.
An overview of the flight prediction system is shown in Fig. 13. In order to visually assist the users, the airports on the currently operating route are displayed so that the users may select an airport. When the weather data is required for input data because the data is not in the system, the system accesses data resource of Japan Meteorological Agency and the weather data is updated. When a departure date is the day, the weather data is used according to an hour of departure date. When there is no weather data according to the day, the newest data in the system is used for the on-time arrival prediction.
In this paper, the predictive model of on-time arrival flight is developed using flight data and weather features based on the data correlation by big data analytics. The contribution is to find the pinpoint data of the sea-level pressure in the relation to the flight features. Thanks to the discovery, Peach-specific weather model can be developed. Various supervised machine learning algorithms are implemented for the predictive model and the best results are shown in Table 13. The proposed predictive model of on-time arrival flight in this paper is 77% of the accuracy, and the studies of arrival delayed flight are 78% to 94.35%. Lu  points out that no model can predict the flight delays accurately up to now, and that these models can give only some reference of the prediction. 77% to 78% of the accuracy in flight on-time or delay prediction may be a threshold of order and chaos affected by weather . The next steps are to improve the predictive model to be worked on chaotic time series .
Information on Specified Japanese Air Carriers by the Ministry of Land, Infrastructure, Transport and Tourism. 2017. http://www.mlit.go.jp/koku/h29zigyo_bf_14.html.
Aircraft by Peach Aviation Limited. 2019. https://www.flypeach.com/pc/en/lm/ai/inflights/seatmap.
FLIGHTSTATS. 2019. https://www.flightstats.com/v2/.
Japan Meteorological Agency. 2019. https://www.jma.go.jp/jp/amedas_h/.
Etani N. Database application model and its service for drug discovery in modeldriven architecture. J Big Data. 2015;2:1–17. https://doi.org/10.1186/s40537-015-0024-1.
Kim YJ, Choi S, Briceno S, Mavris D. A deep learning approach to flight delay prediction. In: Proceeding of 2016 IEEE/AIAA 35th digital avionics systems conference (DASC); 2016. p. 1–6. https://doi.org/10.1109/DASC.2016.7778092.
Choi S, Kim YJ, Briceno S, Mavris D. Prediction of weatherinduced airline delays based on machine learning algorithms. In: Proceeding of 2016 IEEE/AIAA 35th digital avionics systems conference (DASC); 2016. p. 1–6. https://doi.org/10.1109/DASC.2016.7777956.
Belcastro L, Marozzo F, Talia D, Trunfio P. Using scalable data mining for predicting flight delays. ACM Trans Intell Syst Technol. 2016;8:1–20. https://doi.org/10.1145/2888402.
Thiagarajan B, Srinivasan L, Sharma AV, Sreekanthan D, Vijayaraghavan V. A machine learning approach for prediction of on time performance of flights. In: Proceeding of 2017 IEEE/AIAA 35th digital avionics systems conference (DASC); 2017. p. 1–6. https://doi.org/10.1109/DASC.2017.8102138.
Prasad US, Chauhan PA, AshaL S. Data mining & predictive analysis on airlines performance. Int J Pure Appl Math. 2018;118:1–12.
Route Map by Peach Aviation Limited. 2019. https://www.flypeach.com/pc/en/lm/st/routemap.
Kimura H, Kawashima H, Kusaka H, Kitagawa H. Applying a machine learning technique to classification of Japanese pressure patterns. Data Sci J. 2009;8:59–67. https://doi.org/10.2481/dsj.8.S59.
Madhavan A. Big Data 101. 2017. https://nnlm.gov/sites/default/files/u279/BigData101_May21.pptx.
R. 2017. https://www.r-project.org/.
Scikit-learn Machine Learning in Python. 2019. https://scikit-learn.org/stable/.
Lu Z. Alarming large scale of flight delays an application of machine earning. 2010. p. 239–50: https://doi.org/10.5772/9142.
Buizza R. Chaos and Weather Prediction. 2002. https://www.ecmwf.int/en/elibrary/16927-chaos-and-weather-prediction.
Maathuis H, Boulogne L, Wiering M, Sterk A. Predicting chaotic time series using machine learning techniques. 2017. https://www.researchgate.net/publication/320987683_Predicting_Chaotic_Time_Series_using_Machine_Learning_Techniques
MAP OF JAPAN by Geospatial Information Authority of Japan, the Ministry of Land, Infrastructure, Transport and Tourism. 2017. http://www.gsi.go.jp/common/000102099.pdf.
Naritashi Chiba. Narita Airport Operation Status. 2017. https://www.city.narita.chiba.jp/content/000055879.pdf.
Peach Aviation Limited gave me a chance of research and time of research. Cost of research is paid by myself.
The author declare that she has no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Etani, N. Development of a predictive model for on-time arrival flight of airliner by discovering correlation between flight and weather data. J Big Data 6, 85 (2019). https://doi.org/10.1186/s40537-019-0251-y
- Predictive model
- On-time arrival flight of airliner
- Correlation between flight and weather data
- Pressure pattern
- Sea-level pressure
- Random Forest Classifier