# Airline new customer tier level forecasting for real-time resource allocation of a miles program

- Jose Berengueres
^{1}Email author and - Dmitry Efimov
^{2}

**1**:3

**DOI: **10.1186/2196-1115-1-3

© Berengueres and Efimov; licensee Springer. 2014

**Received: **30 October 2013

**Accepted: **20 February 2014

**Published: **24 June 2014

## Abstract

This is a case study on an airline’s miles program resource optimization. The airline had a large miles loyalty program but was not taking advantage of recent data mining techniques. As an example, to predict whether in the coming month(s), a new passenger would become a privileged frequent flyer or not, a linear extrapolation of the miles earned during the past months was used. This information was then used in CRM interactions between the airline and the passenger. The correlation of extrapolation with whether a new user would attain a privileged miles status was 39% when one month of data was used to make a prediction. In contrast, when GBM and other blending techniques were used, a correlation of 70% was achieved. This corresponded to a prediction accuracy of 87% with less than 3% false positives. The accuracy reached 97% if three months of data instead of one were used. An application that ranks users according to their probability to become part of privileged miles-tier was proposed. The application performs real time allocation of limited resources such as available upgrades on a given flight. Moreover, the airline can assign now those resources to the passengers with the highest revenue potential thus increasing the perceived value of the program at no extra cost.

### Keywords

Airline Customer equity Customer profitability Life time value Loyalty program Frequent flyer FFP Miles program Churn Data mining## Background

Previously, several works have addressed the issue of optimizing operations of the airline industry. In 2003, a model that predicts no-show ratio with the purpose of optimizing overbooking practice claimed to increase revenues by 0.4% to 3.2% [1]. More recently, in 2013, Alaska Airlines in cooperation with G.E and Kaggle Inc., launched a $250,000 prize competition with the aim of optimizing costs by modifying flight plans [2]. However, there are very little works on miles programs or frequent flier programs that focus on enhancing their value. One example is [3] from Taiwan, but they focused on segmenting customers by extracting decision-rules from questionnaires. Using a *Rough Set Approach* they report classification accuracies of 95%. [4] focused on segmenting users according to return flight and length of stays using a k-means algorithm. A few other works such as [5], focused on predicting the future value of passengers (Customer Equity). They attempt to predict the top 20% most valuable customers. Using SAS software they report false positive and false negative rate of 13% and 55% respectively. None of them have applied this knowledge to increase the value-proposition of miles programs. All this in spite the fact that some of these programs have become more valuable than the airline that started them. Such is the case of Aeroplan, a spin off of the miles programs started by Toronto based Air Canada. Air Canada is as of today valued at $500 M whereas Aeroplan is valued at $2bn, four times more [6]. Given this it is surprising that so many works focus in ticket or “revenue management” [7] but few focus on their miles programs. In this case study we report the experience gained when we analyzed the data of an airline’s miles program. We show how by applying modern data mining techniques it is possible to create value for both: (1) the owner of the program and (2) future high-value passengers.

- 1.
Age and demographics.

- 2.
Loyalty program purchases etc.

- 3.
Flight activity, miles earned etc.

## Case Description

### Objective

After co-examining the opportunities that the data offered with the airline we decided to focus on high-value passengers with the objective to: predict and discriminate which of the new passengers who enroll the miles program will be high-value customers before it is obvious to a human expert.

### Rationale

- 1.
So that potential high-value customers are identified in order to gain their loyalty sooner.

- 2.
So that limited resources such as upgrades are optimally allocated to passengers with potential to become high-value.

- 3.
To enhance customer profiling.

### Current method: linear extrapolation

### Definition of the target variable

*silver_attain*, and to this model as D/S model. Then past behavior of customers is feed into the model. The model is described further in the text.

### Identifying future high value customers

**Miles program dataset**

Dataset | Fields | Rows | Cols |
---|---|---|---|

Passengers | Id, Date_of_birth, Nationality, City, State, Country, Interest_1, Interest_2, Interest_3, Interest_4, Interest_5, Tier | 1.8 M | 12 |

Flights | Id, Company, Activity_date, Origin, Destination, Class_code, Flt_number, Definition, Miles, Points, | 12 M | 10 |

Activity (not used) | Id, Definition, Issue_date, Miles, Redeposited_points, Flight_date, Class_code, Origin, Destination, Flight_number, ret_flight_date, Ret _flight_number, Ret_class_code, Ret_origin, Ret_destination, Product_code, Cash_before_premium | 1.6 M | 17 |

- 1)
Cleaning of the data.

- 2)
Feature extraction.

### Cleaning

*Passengers*,

*Flights*and

*Activity. Activity,*which contains data related to miles transactions, was not used because it did not help improve results. The Airline suggested that this might be due to incompleteness of data. For example, by Airline policy all monetary transactions were unavailable to us and had been removed from datasets. Table 1 shows the list of fields for each table. First we removed two outlier cases for passengers whose future tier prediction is straightforward or has little merit:

- 1)
Passengers who fly less than one return trip in six years.

- 2)
Passengers with very high activity (number of flights greater than 500).

Additionally, before the data was handed over to us, the airline anonymized the id field. The id had been replaced by an alphanumeric hash code of length 32 characters long. We changed these cumbersome hash ids into integer numerical values to increase the performance. Big datasets made it very difficult for R Language to operate with hash ids during feature engineering.

### Feature extraction

*D*period only. For example, if

*D*= 15 days, then the flights and data we would consider to build the aforementioned vector are any flights between start_date and start_date +

*D*, where start_date is the first flight date of a new passenger. Note that start_date varies for every passenger. Now for each passenger we construct a vector that depends on three variables:

- 1)
The passenger data in the Tables.

- 2)
*start_date*. - 3)
Length of

*D*.

Each component of the vector is called a “feature”. The length of the vector is 634 features. The vector is equivalent to a digital fingerprint for each passenger considered in a given period. Some features are straightforward to calculate and others require complex calculations. Following we explain how each feature was calculated. We have divided the features in three groups: metric, categorical and cluster features.

#### Group 1 - Metric features

A metric feature is data that is already in numerical format. For example, age of a customer.

#### Group 2 - Categorical features

A categorical feature is a text variable, the content of which can only belong to a finite group of choices. The column “City” is such an example. The main problem with categorical variables is that they must be converted to numbers somehow. Since a computer does not understand “city names” per se, there are different ways to operate with such variables. One way is to encode code each name (aka level or category) of each categorical variable into a binary feature. Therefore, we opted for an interpretation of categorical variables using the dummy variable method as follows:

*Let N be the total number of values for a given categorical variable, then we create N new “dummy” features. If a given record has value = i*
^{
th
}
*level, then the i*
^{
th
}
*dummy feature equals 1, otherwise 0.*

Unfortunately, this transformation restricts range of algorithms that are effective: algorithms based on metrical approach such as SVM [8] yield poor results in such cases.

*city”*(Cities to where the passenger has flown during D, from the table flights). Following, we explain an example on how the categorical

*city*variable was processed. Each city to where the airline flies is represented as a feature in the vector. If the given passenger did not fly even once to the given city during D, then the feature is set to 0, if it flew one or more times it is set to 1. The same process is performed for Class tickets letters. The categorical variables that are exploded to binary format using the dummy method are:

- 1)
Passenger Nationality.

- 2)
City of Passenger’s address.

- 3)
State of Passenger’s address.

- 4)
Country of Passenger’s address.

- 5)
Passenger’s Company (Employer).

- 6)
Flight Origin (Airport).

- 7)
Flight Destination (Airport).

- 8)
Ticket Class Code (Economy E, F, K…).

**Example of conversion from categorical to binary for a given passenger**

Feature label | Example value |
---|---|

City_1 (Amsterdam) | 0 |

City_2 (Barcelona) | 1 |

… | … |

City_323 (Zagreb) | 1 |

Class_code_1 (“A”) | 0 |

Class_code_2 (“B”) | 1 |

Class_code_3 | 1 |

Class_code_45 (“Z”) | 0 |

#### Group 3. Cluster features

*k*denotes the number of clusters. So for example is k = 3 it means there are 3 clusters (A, B, C). Every passenger vector will be assigned to the “closest” cluster center as defined by Euclidean distance in n-dimensions, where n is the number of features of the vectors. At this point a cluster label (“A”, “B” or “C”) is assigned to each passenger vector. Then as before we use the dummy variable method to explode cluster labels into binary features. This was done in the following way: For each passenger all previous features (of Tables 3 and 2) are put in vector form. Then, if we are considering 100 passengers, and data of say 354 flights that satisfy the condition of belonging in the

*D*period, then we produce 100 vectors. This is input into a k-means algorithm for k = 3.

*k*denotes de number of clusters into which the vectors will be classified. The algorithm will attempt to classify each of the 100 vectors into k = 3 clusters, A or B or C. Using the dummy variable method we generate three new variables:

*Cluster_k3_A,*Cluster_k3_B

*,*Cluster_k3_C. These 3 variables will become new additional features instead of a categorical feature that says (“A”, “B” or “C”). Then if a vector (passenger) belongs to “A”, its Cluster_k3_A feature is set to 1 and 0 otherwise, conversely, if the vector belongs to “B” then only Cluster_k3_B is set to 1, and so on. This process is repeated for k = 2, 5, 7, 10, 15, 20. Table 4, shows the features and an example where a passenger vector has been classified into cluster “B” for k = 2, cluster “A” for k = 3 etc…

**Metric features**

Feature label | Example |
---|---|

Age_of_passenger | 43 (years) |

Sum of miles | 25500 (miles) |

Average miles | 2300 |

Interest_1 | 0 or 1 |

Interest_2 | 0 or 1 |

Interest_3 | 0 or 1 |

Interest_4 | 0 or 1 |

Interest_5 | 0 or 1 |

**Example of creation of 60 cluster features by the dummy method**

k | Example of K-means output (categorical) | Label of the new binary features | Binary value |
---|---|---|---|

2 | “B” | Cluster_k2_A | 0 |

Cluster_k2_B | 1 | ||

3 | “A” | Cluster_k3_A | 1 |

Cluster_k3_B | 0 | ||

Cluster_k3_C | 0 | ||

5 | “C” | Cluster_k5_A | 0 |

Cluster_k5_B | 0 | ||

Cluster_k5_C | 1 | ||

Cluster_k5_D | 0 | ||

Cluster_k5_E | 0 | ||

7 | “C” | Cluster_k7_A | 0 |

Cluster_k7_B | 0 | ||

Cluster_k7_C | 1 | ||

Cluster_k7_D | 0 | ||

Cluster_k7_E | 0 | ||

Cluster_k7_F | 0 | ||

Cluster_k7_G | 0 | ||

10 | “A” | Cluster_k10_A | 1 |

… | … | ||

Cluster_k10_K | 0 | ||

20 | “A” | Cluster_k20_A | 1 |

… | … | ||

Cluster_k20_T | 0 |

**Example of feature vectors generated per each passenger**

Passenger Id | Example vector |
---|---|

1 | (43, 13.54, …, 0, 0, 12, 0, …, 0, 1, 1, 0, 0, 0, …, 0, 0, 0) |

2 | (26, 2.9, …, 0, 0, 12, 0, …, 0, 1, 1, 0, 0, 0, …, 0, 0, 0) |

#### Target variable

**Example of feature vectors and target variables**

Passenger Id | Feature vector | Target variable |
---|---|---|

1 | (43, 13.54, …, 0, 0, 12, 0, …, 0, 1, 1, 0, 0, 0, …, 0, 0, 0) | 0 (he did not achieve silver status during S period) |

2 | (26, 2.9, …, 0, 0, 12, 0, …, 0, 1, 1, 0, 0, 0, …, 0, 0, 0) | 1 (yes he did) |

Now, we can consider all this vectors as a matrix, where rows are passengers, columns are features, and the last column is the target variable. We will use such a matrix to train a mathematical model with the purpose of predicting the target variable in new passengers. Once the model is trained, to predict if a passenger will attain silver status in a given time frame S (in the future) we only need to generate its feature vector by observing the passenger for a period of time D since their first flight. Once the vector is generated (naturally, without target variable) we can input it into the model and the model will output a number. There are no restrictions on when to ask a model for a prediction as long as the data for the given S period is available.

#### Model

The high nonlinearity of the features (meaning low correlation between target variable and features) restricts the number of algorithms we can use to predict with high accuracy. We chose to blend two algorithms, which were in our opinion the most appropriate for this dataset: the GBM (Generalized Boosting Machine package) and GLM (Generalized Linear Model, glm in R). Both models are trained with the same target variable: *silver_attain,* and try minimize the binomial deviance (Log Loss) of prediction error. While, we chose GLM and GBM because they produce different models, GLM can be considered as non-parametric version of GBM.

### GBM

- 1)
distribution = “bernoulli”,

- 2)
n.trees = 2000,

- 3)
shrinkage = 0.01,

- 4)
interaction.depth = 0.5,

- 5)
bag.fraction = 0.5,

- 6)
train.fraction = 0.1,

- 7)
n.minobsinnode = 10.

A good description of the GBM implementation we used can be found in [9]. It works as follows: the training matrix is used to train the model with the aforementioned parameters. After, some minutes we have a trained model. Now to make a new prediction on a passenger vector or a number of passenger vectors, we input it in the model and the model will return a number from 0 to 1 for each passenger vector. 0 means that the model predicts 0% chance for that passenger to attain silver tier status within the S period of time. A 1 means that the model is the most confident the silver tier will be attained. Training the model takes about 1 hour, but asking the model to predict what will happen to 10,000 passengers takes just seconds.

### GLM

The other algorithm used is GLM. GLM stands for General Linear Model. As before we used the implementation in R Language provided by [10]. GLM is just a simple logistic regression where we optimize binomial deviance, where *Error = silver_attain – prediction.* Default parameters where used except for family which was set to “binomial”.

### Blending with grid search

Combining predictions is known to improve accuracy if certain conditions are met [11]. In general, the less correlation between the individual predictors, the higher the gain in accuracy. A way that usually lowers cross-correlation is to combine models of different “nature”. We chose to combine a decision tree based algorithm (GBM) and regression based one (GLM). For a 3/3 model the relative gain in accuracy due to blending was 3%. The final prediction was constructed as a linear combination of the output of the two. Grid search was used to find which linear combination was optimal on the training set. The optimal combination was 90% of the GBM prediction plus 10% of the GLM prediction.

### Loss function

*silver_attain*) was the huge number of 0’s and small number of 1’s. This is hardly surprising: after all, by design, only a small percentage of the total passengers are supposed to attain a privileged tier. Another interesting fact is that the airline desired to determine future silver and gold passengers with highest possible accuracy; this is, a low false positive rate (mistakes). However, since the penalty cost for false negatives is not high, the acceptable precision level can be lower. These two observations enabled us to modify the usual definition of Precision and Accuracy in the following way:

where P is precision, A is accuracy, N_{0} is a number of true positives or users predicted to become silver and that later indeed became silver, N_{1} is the number of false positives or users predicted to become silver but who did not indeed become silver after S days and *N*
_{2} is the number of false negatives or number of users predicted to not become silver and who indeed became silver. In the next section we will show the calculated error for the normal Precision and Recall and the calculated error according to chosen loss functions (Eq. 1).

### Discussion and evaluation

It takes about 18 hours on a laptop to clean the data and construct the features. To build a D/S model takes about 1 hour per each D/S combination. Once a model is pre-calculated, making a prediction for one single passenger takes less than 350 ms (similar to a Google search).

### Compared performance

*P*and

*A*of Table 7 were calculated according to Eq. 1. Table 8 shows actual numbers on one particular example for D/S = 3/3 months respectively. Table 8 also shows why the calculation of Precision and Accuracy in the usual way is not suitable to assess model performance on the given dataset because of huge number of 0's. A trade-off between Accuracy, Precision, D and S clearly exists. These parameters can be adjusted to suit various forecasting needs. Additionally, Table 7 data shows that accuracy is, roughly speaking, inversely proportional to the length of S. This is, the longer the S time span the lower A and P will be. This came a bit of a surprise to us as we expected that the averaging effects of longer S time span would to facilitate prediction, but in fact the opposite is true: shorter more limited time spans lead to more accurate predictions. On the other hand, as is the case with weather forecasts, it is easier to predict events that are close in the future than those that are farther away.

**Comparison of prediction power of extrapolation vs. D/S model**

Question asked D months after 1st flight: will they become Silver within S months? | ||||
---|---|---|---|---|

Case | D | S | Extrapolation of miles Rx | D/S Model Rx |

0 | 0.5 | 3 | 0.39 | 0.60 81% 31% |

1 | 1 | 1 | 0.39 | 0.71 87% 53% |

2 | 1 | 2 | 0.39 | 0.70 89% 48% |

3 | 3 | 1 | 0.50 | 0.89 97% 82% |

4 | 3 | 3 | 0.39 | 0.67 95% 46% |

5 | 6 | 3 | 0.51 | 0.83 96% 69% |

**Precision and accuracy of silver attainment**

Will a user become Silver in 1 month? | Stats | |||
---|---|---|---|---|

Predicted (With 3 months of data) | What really happened | Is the prediction correct? | Numbers by case | Percent of total |

No | No | Yes | 54752 | 97.09 |

Yes | Yes | Yes | 1309 | 2.32 |

No | Yes | No | 293 | 0.52 |

Yes | No | No | 37 | 0.07 |

Total | 56391 | 100 |

### Confirmation of no data leakage

However, due to the spectacular high accuracy rates obtained, the airline showed a healthy concern that the prediction might be wrong due to data leakage. Data leakage happens when some data from the training set somehow contains information about what wants to be predicted (*target_variable*). The only way to proof 100% that there is no data leakage is to do predictions in the future (about data that does not exist at the time of the prediction). To address this valid concern the model was used to predict what the passengers would do in the future.

- 1)
Which of the 49572 customers that had enrolled during the last quarter of 2012 would attain silver in the first 30 days of 2013, this is a D = 3/S = 1 model and

- 2)
Which of the 7890 customers that had enrolled during the last two weeks of December 2012 would attain silver during the first 90 days of 2013 a D = 0.5/S = 3 model.

## Discussion

By translating passenger data from “Airline” timeline to a timeline relative to each passenger first flight, we have shown that a D/S model yields high accuracies. Furthermore, taking advantage of recently made available data mining libraries [9, 10] we outperformed simple extrapolation models and previous works [5]. False positive rates are less than 3%. The causes of a false positive have not been investigated in the scope of this project, but can be due to either/or a combination of: (1) the predictive nature of the data is not unlimited. (2) The predictive power of the model can be improved. However, the most interesting result is that the perceived value of a miles program can be increased dramatically to the very customers that matter most to the airline: the ones with high likelihood of becoming Silver.

In our experience with previous data mining projects, rather than fine tune models, the most effective way to improve accuracy is to add new features which are as uncorrelated as possible with existing features. A good place to look for potential candidates are features derived from different data sources other than the Airline CRM database, for example publicly available social media data.

## Declarations

### Acknowledgments

Sajat Kamal for the Aeroplan insights. Dr. Barry Green and Roy Kinnear showing themselves to be greater than their prejudices and letting us modelling the data.

## Authors’ Affiliations

## References

- Lawrence RD, Hong SJ, Cherrier J
**Proceedings of the ninth SIGKDD international conference on Knowledge discovery and data mining.***Passenger-based predictive modeling of airline no-show rates*2003, 397–406. -
*Alaska Airlines GE Flight Quest data mining challenge*. 2013. http://www.gequest.com/c/flight - Liou JJH, Tzeng G-H:
**A dominance-based rough set approach to customer behavior in the airline market.***Inf Sci*2010,**180**(11):2230–2238. 10.1016/j.ins.2010.01.025View Article - Pritscher L, Feyen H
**Data Mining for Marketing Applications.**In*Data mining and strategic marketing in the airline industry*. Citeseer; 2001. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.7062 - Malthouse EC, Blattberg RC
**Can we predict customer lifetime value?**In*Journal of Interactive Marketing*. Wiley Online Library; 2005. http://www.researchgate.net/publication/227633642_Can_we_predict_customer_lifetime_value/file/60b7d517fd5bb2905e.pdf -
*Aeroplan loyalty program*. http://en.wikipedia.org/wiki/Aeroplan - Smith BC, Leimkuhler JF, Darrow RM:
**Yield management at American airlines.***Interfaces*1992,**22**(1):8–31. 10.1287/inte.22.1.8View Article - Burges CJ:
**A tutorial on support vector machines for pattern recognition.***Data Min Knowl Disc*1998,**2**(2):121–167. 10.1023/A:1009715923555View Article -
**gbm: Generalized Boosted Regression Models. R package version 2.0–8***Greg Ridgeway with contributions from others*2013. http://CRAN.R-project.org/package=gbm - Ian M:
*glm2: Fitting Generalized Linear Models. R package version 1.1.1*. 2012. http://CRAN.R-project.org/package=glm2 - Valentini G, Masulli F
**Neural Nets.**In*Ensembles of learning machines*. Berlin Heidelberg: Springer; 2002:3–20.

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.