The aim of the LSTM model is to automate credit card behaviour scoring for customers as well as to trigger an early alert for credit card default. The framework of the proposed model is presented in Fig. 5. The workflow presented will let us fully investigate the model performance to make reliable conclusions.
The proposed framework consists of several steps. Firstly, the dataset is preprocessed and formatted to be used by Bidirectional LSTM classifier. As a next step, fivefold validation technique is used to get prediction for all customers in dataset. Then the performance measures are calculated for different groups of customers which is of financial interest to the bank institutions (banks are especially interested in customers with unsatisfactory history of payments). To outline performance of the model it is compared to benchmark models using various performance measures. Results are discussed in the final section.
Dataset
There are only few open source transactional datasets that can be used to test efficiency of proposed model. Majority of datasets are either nontemporal or they are from different field of research. To verify the practicality and effectiveness of the proposed LSTM model we use a public^{Footnote 1} real credit cards dataset used in Bahdanau et al. [52] and can be easily converted to temporal form.
Dataset description
The dataset used in this paper is a public nontransactional credit cards dataset that reflects customer’s default payments in Taiwan [54]. It has been widely used in validating credit and behavioural scoring models [55,56,57], also in deep learning models [58, 59]. Usually, banks do not disclose transactional databases in raw form, and thus majority of datasets in the open access are in processed form. Hence, we used this dataset because this is the only publicly available dataset which can be converted into temporal form (customer payment statistics for each month rather than aggregated values).
The size of the data set is 30,000 records, which is large enough to test the efficiency of the proposed model. The number of nondefault payments is 23,364, while the number of default payments is 6636 (proportion of default payments in dataset is 22%). There were no missing values in dataset.
In the dataset the following 23 variables are used as explanatory:

(1)
X1: Amount of the given credit, which includes both the individual consumer credit and his/her family (supplementary) credit.

(2)
X2: Gender (1 = male; 2 = female).

(3)
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

(4)
X4: Marital status (1 = married; 2 = single; 3 = others).

(5)
X5: Age (year).

(6)
X6–X11: History of past payment. Tracked payment records are denoted from September to April 2005 by X6–X11, respectively. The measurement scale for the repayment status is: − 1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months; ...; 8 = payment delay for 8 months; 9 = payment delay for 9 months and above.

(7)
X12–X17: Amount of bill statement. The amount of bill statement is denoted from September to April 2005 by X12–X17, respectively.

(8)
Amount of previous payment (NT dollar). X18 = amount paid in September 2005; X19 = amount paid in August 2005; ...; X23 = amount paid in April 2005.
The variables can be divided into two groups: numerical and categorical. The examples of the first are: X1 (amount of given credits), X5 (age), X6–X11 (history of past payment), etc. The second group contains such variables: X2 (gender), X3 (education), X4 (marital status).
Dataset preprocessing and partitioning
Before feeding into a neural network, it was split into two parts: temporal data and nontemporal data. Columns X6–X23 as temporal data that reflect customer behaviour in time were reshaped into a threedimensional array of shape (number of customers, number of months, number of features). According to the data set description, for each customer we have information about his payment behaviour during 6 previous months. Therefore, the second dimension of the array is equal to six. The number of temporal features available for each customer is equal to three, namely:

(1)
Payment delay by the end of each past month;

(2)
Amount of bill statement by the end of each past month;

(3)
Amount of the payment in each month.
Nontemporal categorical data was split into binary, thus for each customer there are eight nontemporal features:

(1)
Amount of given credit.

(2)
Gender.

(3)
Education—graduate school.

(4)
Education—university.

(5)
Education—high school.

(6)
Education—others.

(7)
Marital status.

(8)
Age.
To properly test the performance of the model we use fivefold cross validation as partitioning technique. All customers were randomly split into five groups, and during each fold each group become testing once.
As it can be seen, most of the information in the dataset is stored in temporal features of past credit card activity and payments. On the other hand, nontemporal features are too general and are, in fact, categorical features. That is why without using temporal features it is impossible to predict future missed payment probability.
Attention mechanism is used to provide a context. Hence, age and gender provide such context for temporal financial information. It means that similar payment behaviour for young and old customers can lead to different payment outcomes (e.g., young customers can forget or skip to pay in some month and have bad payment history, but they would pay eventually).
Benchmark models development
To measure how well the proposed approach has performed, the results of the proposed model are compared to five benchmark models, namely, GB, BNN, RF, SVM and LOGR. The latter model is the industry standard for developing credit scoring models [60, 61]. However, [61] has stated that it is beneficial to compare a new method with the standard one as well as other established techniques. MLP, RF, and SVM have been used in several studies as a benchmark model [62]. The theoretical backgrounds of the models are described in the following sections.
Gradient boosting
Gradient Boosting (GB) machines are a group of powerful machine learning techniques that have demonstrated impressive accomplishment in a wide scope of practical applications. They are highly customizable to the particular needs of the application, like being learned with respect to different loss functions. The fundamental thought of boosting is to add new models to the ensemble consecutively. At each particular iteration, a new weak, baselearner model is trained with respect to the error of the full ensemble learnt up to the last iteration [63].
Bagging neural network
Neural Networks (NN) are machine learning frameworks motivated by the scheme of the biological neuron [64]. These are shown so as to have the option to copy the human brain capacities regarding discovering complex connections between the inputs and outputs [65]. One of the most wellknown designs for NNs is the multilayer perceptron, which comprises of one input layer, at least one hidden layer, and one output layer. As per [66], central points of contention waiting be tended to in building NNs are their topology, structure, and learning algorithm. The most used MLP topology for credit scoring is threelayer feedforward back propagation network. Consider the input of a credit scoring training set \(x = \left\{ {x_{1} , x_{2} , \ldots , x_{n} } \right\}\); the MLP model works in one direction, starting from feeding the data \(x\) to the input layer (\(x\) includes the customer’s attributes or characteristics). These inputs are then sent to a hidden layer through links, or synapses, associated with the random initial weight for every input. The hidden layer will process what it has received from the input layer and, accordingly, will apply an activation function to it. The result is worked as a weighted input to the output layer, which will further process weighted inputs and apply the activation function, take the lead to a final decision [67]. In recent years ensemble models became more popular, so instead of a single NN, Bagging NN is used with 10 neural networks.
Support vector machines
A SVM is another groundbreaking machine learning method utilized in order and credit scoring issues. SVMs are used for binary classification to make the best separation that splits the input data into two classes (good and bad credit). SVMs were first proposed by Cortes and Vapnik [68], adapting the form of a linear classifier. The primary distinction of the SVM model from the linear one is the occurrence of a function that is used to map the data into a higher dimensional space. To achieve this, linear, polynomial, radial basis, and sigmoid kernel functions were suggested. An SVM maps nonlinear data of two classes to a highdimensional feature space, with a linear model then being used to implement the nonlinear classes. The linear model in the new feature space will denote the nonlinear decision margin in the original space. Consequently, the SVM will build an optimal line or hyperplane that can perfectly separate the two classes in the space. SVMs are being widely used in credit scoring and other fields owing to the method’s exceptional results [69, 70].
Random forests
A random forest (RF), as proposed by Breiman [71], is considered an innovative decision tree (DT) technique which consists of a large number of trees that are created by generating n subsets from the core dataset, with each subset being a tree created based on randomly selected variables, therefore the name “random forest”. After all the DTs are generated and trained, the final decision class is based on a voting method, where the most popular class decided by the trees is selected as the final output class by the RF.
Logistic regression
Logistic Regression (LOGR) has been considered until now to be the industry standard for credit scoring model development [68]. It is a broadly used statistical technique that is popular for solving classification and regression problems. LOGR is used to model a binary outcome variable, usually characterized by 0 or 1 (good and bad loans). The LOGR formula is expressed in Atiya and Parlos [19].
Performance measure metrics
To validate the proposed model and in order to reach a reliable and strong conclusion on the predictive accuracy of the proposed method, five performance indicator measures are implemented, specifically: (1) accuracy, (2) Area Under the Curve (AUC), (3) Hmeasure, (4) Kolmogorov–Smirnov (KS) chart, and (5) Brier’s score. These are chosen because they are popular in credit scoring and they give a comprehensive view on all facets of model performance. The accuracy stands for the proportion of correctly classified good and bad loans, which measures the predictive power of the model. As such, this is a standard that measures the discriminating ability of the model [68]. The accuracy can be defined as the percentage of correctly classified instances
$$ \frac{TP + TN}{{TP + TN + FP + FN}}, $$
where TP, FN, FP and TN represent the number of true positives, false negatives, false positives and true negatives, respectively.
AUC is a tool used in binary classification analysis to determine which of the models used predicts the classes the best. According to Hand [72], the AUC can be used to estimate the model’s performance without any preceding evidence about the error costs. However, it assumes different cost distributions among classifiers depending on their actual score distribution, which prevents them from being compared effectively. As a result, Hand [72] proposed the Hmeasure as an alternative to AUC for measuring classification performance, which assumes different cost distributions between classifiers without depending on their scores. In other words, this measure finds a single threshold distribution for all classifiers. AUC is evaluated as area under the ROCcurve for measured classifier.
The KS distribution was originally formulated as an observance hypothesis test for distributionfitting to data. In binary classification problems, it has been used as a divergence metric for assessing the classifier’s discriminant power by measuring the distance that its score produces between the cumulative distribution functions of the two data classes [73].
Lastly, the Brier score, which is also known as the mean squared error [74], measures the accuracy of the probability predictions of the classifier by taking the mean squared error of the probability. In other words, it shows the average quadratic possibility of a mistake. The main difference between the Brier score and accuracy is that it directly takes the probabilities into the account, while accuracy transforms these probabilities into zero or one based on a predetermined threshold or cutoff score. The lower the Brier score, the better the classifier performance. The most common formulation of the Brier score is:
$$ BS = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {f_{t}  \sigma_{t} } \right)^{2} $$
in which \(f_{t}\) is the probability that was forecast, \(\sigma_{t}\) the actual outcome of the event at instance t (zero if it does not happen and one if it does happen) and N is the number of forecasting instances.
To check whether a model’s behavioural score can be considered as the likelihood of missed payment, calibration curves are used. Wellknown as reliability diagrams, they can be applied to classifiers which predictand obtain a probability of the respective class. Reliability diagrams offer a diagnostic to check whether the scores are trustworthy. Thus, a prediction is considered as trustworthy if the event happens with an observed relative frequency consistent with the forecast value [75]. A calibration curve works by sorting the output scores of the classifier. In Particular, the forecasts are apportioned into a fixed number of buckets along the xaxis. The number of classes or labels are then counted for each bin (e.g., the relative observed frequency). After All, the counts are normalized. The results are then plotted as a line plot. If the classifier is forecasting accurately, then it is expected that the percentage of dominant class classifications and the mean probabilities assigned to the dominant classes in each bin to be close to one another. If it is not doing so accurately, these two values diverge. The point positions on the curve relative to the diagonal help to interpret the forecasts, for example:

(1)
Below the diagonal: the model has overforecast; the probabilities are too large.

(2)
Above the diagonal: the model has underforecast; the probabilities are too small.
Statistical significance tests
As indicated by Witten et al. [76], it is not adequate to demonstrate that one model accomplishes results in a way that is better than another, because of the different performance measures or splitting techniques used. For complete performance evaluation, it would appear to be proper to actualize some some hypothesis testing to stress that the experimental differences in performance are statistically significant and not just due to random splitting influences. Selecting the right test for detailed experiments depends on factors such as the number of datasets and the number of classifiers to be contrasted.
According to Demšar [77], statistical tests can be parametric (e.g., paired ttest) and nonparametric (e.g., Wilcoxon, McNemar). However, the author recommended that nonparametric tests are desirable to parametric tests as the last can be conceptually unsuitable and statistically unsafe. Nonparametric tests may be more applicable and safer than parametric tests since they do not presume the normality of data or homogeneity of variance [77]. Accordingly, in this study, the McNemar test to compare the ranking performance of all the models measured across a unique dataset is adopted [78]. According to Kavzoglu [79], the McNemar test investigates the statistical significance of the differences in classifiers’ performances. The test is a Chisquare (χ^{2}) test for goodness of fit, comparing the distribution of counts expected under the null hypothesis to the observed counts. It is applied to a 2 × 2 contingency table, the cells of which include the number of cases correctly and incorrectly classified by both models and the number of samples classified correctly by only one model.
The aim of the McNemar test is to check the null hypothesis, which says that neither of the two models performs better than the other. The alternative hypothesis asserts that the performance of the two models are not equal. The McNemar statistic is as illustrated in Eq. (11):
$$ \chi^{2} = \frac{{\left( {\left {n_{ij}  n_{ji} } \right  1} \right)^{2} }}{{n_{ij}  n_{ji} }} $$
(11)
where \(n_{ij}\) indicates the number of cases misclassified by model \(i\) but classified correctly by model \(j\), and \(n_{ji}\) indicates the number of cases misclassified by model \(j\) but not by model \(i\).
The computed statistic is thought as a value from the \(\chi^{2}\) distribution with 1 degree of freedom. Based on this assumption, the pvalue is calculated. If this pvalue is smaller than predefined significance level α, then we fail to reject the null hypothesis. Otherwise, we reject the null hypothesis, and accept the alternative hypothesis. For example, if the value of test statistic is greater than 3.84, then (according to the \(\chi^{2}\) table at 95% confidence interval) it can be stated that the two methods differ in their performances. In other words, the difference in performance between the methods \(i\) and \(j\) is said to be statistically significant [78, 79].