Enhanced credit card fraud detection based on attention mechanism and LSTM deep model

As credit card becomes the most popular payment mode particularly in the online sector, the fraudulent activities using credit card payment technologies are rapidly increasing as a result. For this end, it is obligatory for financial institutions to continuously improve their fraud detection systems to reduce huge losses. The purpose of this paper is to develop a novel system for credit card fraud detection based on sequential modeling of data, using attention mechanism and LSTM deep recurrent neural networks. The proposed model, compared to previous studies, considers the sequential nature of transactional data and allows the classifier to identify the most important transactions in the input sequence that predict at higher accuracy fraudulent transactions. Precisely, the robustness of our model is built by combining the strength of three sub-methods; the uniform manifold approximation and projection (UMAP) for selecting the most useful predictive features, the Long Short Term Memory (LSTM) networks for incorporating transaction sequences and the attention mechanism to enhance LSTM performances. The experimentations of our model give strong results in terms of efficiency and effectiveness.

2. Machine learning models that are never updated are inadequate as they do not take account of changes and trends in customer spending behaviors, for example during holiday periods and geographical regions.
In such situations, financial institutions should establish continuously an increasingly sophisticated fraud detection system (FDS) to mitigate the prevailing menace of fraud and detect it immediately, with an objective to prevent fraud before it occurs, protect consumers' interests and reduce the heavy annual financial losses caused by fraud around the world [4][5][6][7][8].
In this paper, we propose a novel credit card fraud detection system based on Long Short Term Memory (LSTM) networks and attention mechanism. The attention mechanism allows a sequence based neural network to automatically focus on the data items that are the most important to the classification task by a data-driven weighted average of local information contained in each term of the sequence which results in an improved detection performance. The main contributions of our proposed fraud detection method are: 1. Optimizing the process of learning classifiers by using feature selection and dimension reduction algorithms such as PCA, t-SNE and UMAP. 2. Overcoming the issue of the imbalanced dataset and incrementing the learning rate by using the Synthetic Minority Oversampling Technique (SMOTE). 3. Constructing the context of consumer's spending behavior by using the sequence learner LSTM recurrent neural networks, as a dynamic pattern recognition classifier to model long term dependencies within transaction sequences. 4. Applying the attention mechanism upon LSTM recurrent networks, that efficiently allows the classifier to learn where to pay selective attention in the input sequence for the global fraud decision, which deliver good performances. 5. Performing experiments on two different datasets from which we conclude that our method is competitive and alternative to existing LSTM works.
This work opens perspectives for dealing with sequential data in fraud detection area. In order to ensure reproducibility, the source code and results of the proposed model can be found at https:// github. com/ bibti ssam/ LSTM-Atten tion-Fraud Detec tion. The rest of the paper is organized as follows; "Related works" section presents the related works describing prior works in credit card fraud detection domain, "Background" section presents the structure of our proposed model, "Methods and materials" section describes the datasets used in this study and discusses the results obtained. Finally, the paper is concluded in "Conclusion" section and suggested ideas for future research.

Related works
A wide range of machine learning approaches based on supervised learning, unsupervised learning, anomaly detection and ensemble learning have been used in payment card fraud detection [9]. In particular, supervised classification techniques demonstrated to be extremely effective for facing this challenge, where pre-classified datasets containing labeled historical transactions are used for training a classifier that builds a detection model capable to predict whether a new transaction is fraudulent or not. Some of these algorithms are support vector machines [10,11], hidden Markov models [11,12], logistic regression algorithms [10,13], decision trees [14,15], random forests [10,[16][17][18][19], and k-nearest neighbors [20,21].
Unsupervised classification methods are used to detect unusual behavior of a system and to identify transactions that do not conform to the model as potential fraudulent cases [22][23][24]. It can help to detect some new patterns of fraud that have not been detected before.
However, credit card fraud detection presents several challenges that attract the attention of artificial intelligence communities for several reasons. One of them is the fact that credit card fraud data sets are highly imbalanced since the number of genuine transactions is much higher than the fraudulent ones. Thus, many of traditional classifiers fail to detect minority class objects for these skewed data sets [25,26]. On the other hand, these traditional classifiers aim to identify transactions with a high probability of being fraud, based only on individual transaction information such as amount, time and transaction location [27][28][29] but ignore detailed sequential information that defines consumers' profile. Such models are inadequate for credit card fraud detection, since they do not consider the consumer spending behavior, which is useful to discover relevant fraud patterns that evolves over time due to seasonality and new attack strategies [30,31].
Recently, deep learning methods based on recurrent neural networks (RNN) and specially its variant Long Short Term Memory Networks (LSTM), have been used in fraud detection field given their reputation as one of the most accurate learning algorithms in sequence analysis work [32][33][34][35][36]. RNN is a dynamic machine learning approach capable of analyzing the dynamic temporal behaviors of various bank accounts by modeling the sequential dependency between consecutive transactions of credit card holders.
The attention mechanism has also been recently proposed [37] as a way to find context-dependent representations. This method takes into account dependencies between items in a sequence despite of their distance. It has been used to define context in machine translation [37] and image captioning [38] with great success. The idea behind the attention mechanism is to take a weighted average of a set of vectors to construct a context vector that contains the most relevant information, which is then used as input in the next layer.
In this paper, we use LSTM based sequence models and attention models to discover temporal correlations between events that are possibly far away from each other in the input sequence which improve the effectiveness of the classification task and allow for an increase in the detection of fraudulent transactions when compared to traditional models.

Background
In this section, we will introduce the related literature that formed the basis of our work.

Dimensionality reduction algorithms
Feature selection and feature extraction are fundamental preprocessing steps in fraud detection systems [16,39], to select the optimal subset of relevant features by removing redundant, noisy and irrelevant features from the original dataset, and decrease the computational cost without a negative effect on the classification accuracy.

Feature selection
The basis of credit card fraud detection lies in the analysis of cardholder's spending behavior. This spending profile is analyzed using optimal selection of variables that capture the unique behavior of a credit card and detect very dissimilar transactions within the purchases of a customer. Also, since the profile of both a legitimate and fraudulent transaction tends to be constantly changing, optimal selection of variables that greatly differentiates both profiles is needed to achieve efficient classification of credit card transaction [40][41][42]. Therefore, Swarm intelligence based feature selection approach [43] will be used to explore our datasets and study the influence of each feature in the prediction of the target class.

Feature extraction
In this research, we employ three algorithms to reduce the dataset dimensionality, namely Principal Component Analysis (PCA) [44], t-distributed stochastic neighbor embedding (t-SNE) [45,46], and uniform manifold approximation and projection (UMAP) [47,49]. These aforementioned algorithms have been seen as one of the best dimension reduction algorithms used for feature extraction in many applications such as bioinformatics, and visualization [47].

• Principal Component Analysis (PCA)
PCA is a popular method used in dimension reduction, aiming to transform the original set of n correlated variables to a new subset of m uncorrelated variables called principal components (PCs) that successively maximize variance. These new variables are linear combinations of the original variables and are derived in decreasing order of importance, such that the first principal component accounts for as much as possible of the variation in the original data.
Given a set of n correlated variables f 1 , f 2 ,…, f n , the objective of PCA consists in replacing these n measured variables by m derived variables z 1 ,z 2 ,…,z m , which are uncorrelated and whose variances decrease from first to last, without minimizing any information loss. This transformation is done with respect to the following properties: a. (1) z 1 = α 1 f 1 + α 2 f 2 + · · · + α n f n .

c.
In general, z k has maximum possible variance among all possible linear functions of f 1 , f 2 ,…, f n , subject to z k being uncorrelated with z 1 ,z 2 ,…,z k−1 , for 2 k n.
Although PCA is able to cover the maximum variance among features, but as a linear algorithm, it may performs poorly on the features with nonlinear relationship. Therefore, in order to present high dimensional data on low dimensional and nonlinear basis, some nonlinear dimensional reduction algorithms such as t-SNE and UMAP are employed.

• t-distributed Stochastic Neighbor Embedding (t-SNE)
The t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm that is well suited for reducing high nonlinear dimensional data into the two or three dimensional space. It tries to place a point from high dimensional space in a low dimensional one so as to preserve neighborhood identity; closer data points mean high similarity.
There are two main stages in t-SNE. First, it finds a probability distribution over pairs of data such that a pair of similar data points is given a high probability, while a pair of farther away points is given a low probability. Second, it defines a probability distribution in the lower dimension space that is similar to that in the original high dimensional space, and aims to minimize the Kullback-Leibler (KL) divergence between the two distributions [45]. Given a high dimensional input dataset x 1 ,x 2 ,…,x n in R m , our goal is to find an optimal low dimensional representation y 1 ,y 2 ,…,y n in R k , such that k m. The similarity of data point x j to data point x i is represented by the conditional probability p ji . For the low dimensional counterparts y i and y j for the high dimensional data points x i and x j , it is computed a similar conditional probability denoted by q ji . Once p ji and q ji are calculated, the goal of the t-SNE algorithm is to minimize the mismatch between the high and low dimensional representations. The cost function (Eq. 2) that minimizes the Kullback-Leibler (KL) divergences over all points is given as: where P and Q represent respectively the probability distributions for p ji and q ji . Although the t-SNE algorithm is a good technique to visualize data in a low dimensional space, it computes pairwise conditional probabilities for each pair of samples and involves hyperparameters that are not always simple to tune, which comes with a high computational cost.

• Uniform Manifold Approximation and Projection (UMAP)
Uniform Manifold Approximation and Projection (UMAP) is an emerging dimensionality reduction technique that has been recently published by McInnes and Healy [49]. It is based on the theory of Riemannian geometry and algebraic topology that uses local manifold approximations and patches together their local fuzzy simplicial set representations to construct a topological representation of the high dimensional data, then a similar process can be used to search for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure of the original space. Unlike t-SNE which utilizes probabilistic model, UMAP is a graph-based algorithm. The first phase of UMAP is to construct a weighted k-neighbour graph representation of each of the original high-dimensional data point such that the edge-wise cross-entropy between the weighted graph and the original data is minimized. Then, the k-dimensional eigenvectors of the UMAP graph are used to represent each of the original data point. UMAP considers the input data X = {x 1 , x 2 , . . . , x n } in R m , with a metric (or dissimilarity measure) d: X × X → R + and look for an optimal low dimensional representation y 1 , y 2 , . . . , y n in R k , such that k < m . Given an input hyperparameter k, for each x i we compute the set {x i1 , x i2 , . . . , x ik } of the k nearest neighbors of x i under the metric d. For each x i , we will define ρ i and σ i . Let: where σ i is defined such that: One chooses ρ i to ensure at least one data point is connected to x i with an edge weight of 1 which is equivalent to the resulting fuzzy simplicial set being locally connected at x i . The σ i is set as a length scale parameter, defining a weighted directed graph Ḡ = (V , E, ω) , where V is the set of vertices (in this case, the data X), E is the set of directed edges E = (x i , x ij )|1 j k, 1 i n , and ω is the weight function for edges defined by setting: UMAP tries to define an undirected weighted graph G from directed graph Ḡ via symmetrization. Let A be the adjacency matrix of the graph Ḡ . A symmetric matrix can be obtained by: where T is the transpose and ⊗ denotes the Hadamard (or pointwise) product. Then, the undirected weighted Laplacian G (the UMAP graph) is defined by its adjacency matrix B. The goal is to find the optimal low-dimensional coordinates y i n i=1 , y i ∈ R k , that minimizes the edgewise cross entropy with the original data at each point. The evolution of the UMAP graph Laplacian G can be regarded as a discrete approximation of the Laplace-Beltrami operator on a manifold defined by the data [48]. Implementation and further detail of UMAP can be found in [49].
Compared to t-SNE, UMAP is better able to preserve as much of the local and more of the global data structure, with superior runtime performance [49].

Long Short Term Memory networks
Long Short-Term Memory (LSTM) is a special type of artificial recurrent neural network (RNN) architecture used to model time series information in the field of deep learning (Fig. 1). In contrast to standard feedforward neural networks, LSTM has feedback connections between hidden units that are associated with discrete time steps, which allow long term sequence dependencies to be learned and a transaction label to be predicted given the sequence of past transactions. LSTMs were developed to overcome the problem of vanishing and exploding gradient that can be observed during the training of traditional RNNs [50].
LSTM unit consists of a memory cell that stores information which is updated by three special gates: the input gate, the forget gate and the output gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. Figure 2 depicts the LSTM unit structure.
At time t, x t is the input data of the LSTM cell, h t−1 is the output of the LSTM cell at the previous moment, c t is the value of the memory cell, h t is the output of the LSTM cell.
The LSTM unit calculation method can be divided into the steps below: 1. The first step according to Eq. (3) is to calculate the candidate's memory cell value c t , W c is the weight matrix, b c is the bias. Benefit from three control gates and memory cell, LSTM can easily retain, read, reset and update information over long periods of time.

Attention mechanism
In modern deep learning research such as computer vision and language translation [51,52], attention mechanisms have become an effective way to achieve excellent results by selecting important information. This mechanism aims to focus only on the most relevant piece of information, rather than all of it, which is sufficient for further neural processing [53].
To illustrate the attention mechanism, let consider an RNN Encoder-Decoder architecture: an encoder reads the input sequence of vectors x = (x 1 ,x 2 ,…,x n ) into a vector c. This approach is often explained in an RNN structure in the following form: And: where S t is the hidden state, c is the output vector of the RNN which is generated by the hidden states. In attention model, the context vector c t is strongly related to the (8) c = q(S 1 , . . . , S n ), sequence of annotations ( h 1 ,…,h n ) to which an encoder maps the input sequence. The information about the whole input sequence with a strong focus on the parts surrounding the t-th word of the input sequence is contained in the annotation h t . Details can be found in the following explanations. Figure 3 shows the attention mechanism in neural network. A weighted sum of those annotations h t forms the context vector c t : where the weight α tj of each annotation h t is given by: In which: The function a(S t−1 , h j ) is an alignment model that describes the matching ability between the inputs around position j and the outputs at position t. The RNN hidden state S t−1 and the j-th annotation h j of the input sentence are used to calculate the score. The attention mechanism allows a neural network to focus on a subset of its inputs: it always chooses the most relevant inputs. The attention mechanism in Fig. 3 aims to select the most important inputs from the input sequences x 1 ,x 2 ,…,x n by using the weight α tj .

Methods and materials
As mentioned above, our proposed model uses first data preprocessing techniques through applying feature selection and dimensionality reduction over credit card fraud datasets, to reduce the number of input features before fed into the model. Then the sequence learner LSTM is employed as the base dynamic pattern recognition classifier to capture the sequential dependency between consecutive credit card transactions. Next, attention mechanism is introduced to give different focus to the information outputted from the hidden layers of LSTM, which allow our model to discover relevant fraud patterns and detect better very dissimilar transactions within the purchases of a consumer. The architecture of the proposed system is shown in Fig. 4. (17) e tj = a(S t−1 , h j ).

Fig. 3 Attention mechanism
The steps of our proposed model for credit card fraud detection are detailed below. We will describe the two datasets we use in our experiments, data preprocessing results and we will provide the detailed implementation and the evaluation metrics used in this work.

Datasets
Datasets provide a way to train and validate the efficacy of the proposed methods, hence playing an important role in motivating research. In this subsection, we describe two different datasets used in the experimentations of our proposed approach. A brief summary of the two datasets is presented in Table 1.

Dataset-1
The first dataset downloaded from www. kaggle. com, consists of credit card transactions made by European cardholders occurring within two days in September 2013, where it has 492 frauds out of 284,807 transactions [54]. It consists of 31 features including the time when a transaction took place, the amount of transactions, and 28 other attributes labeled from V1 to V28 and the target label 'Class' which decides if a transaction is fraudulent or not by a binary value '1' and '0' respectively.

Dataset-2
The second dataset consists of 594,643 transactions made during 180 simulated days, among which 7200 ( ≈ 1.2%) are considered fraudulents. This is a synthetic dataset created for financial fraud detection by using BankSim software, which is a simulation tool specifically designed to emulate fraud data [55]. BankSim uses a multi-agentbased simulation methodology based on a sample of aggregated real transaction data that a bank in Spain offers. The original bank data is made up of thousands of transactional data records from November 2012 to April 2013. BankSim uses multiple agents of three different categories to mimic this original bank data: traders, customers, and fraudsters. These agents communicate with each other over a sequence of simulated days, resulting in a purchase transaction log closely resembling the original bank data. All attributes are presented in Table 2.

Dataset processing
We can see that our two datasets are highly imbalanced since the number of negative (majority) instances outnumbers the amount of positive (minority) class instances. In fact, for example in Dataset-1, frauds are typically less than 0.171% of the overall transactions, as shown in Fig. 5. Thus, to enhance the classification performance  [56,57] to generate synthetic training instances from the minority class. Figure 6 presents the schematic diagram of the transformed Dataset-1 using SMOTE method.

Feature selection
As mentioned above, we use feature selection as a first step in the exploration of our datasets. The objective is to study the influence of each feature in the prediction of the target class and select the optimal subset of relevant features by removing redundant and noisy attributes. From the presented visual Swarm intelligence algorithm plots made for Dataset-1 in Fig. 7, we can see that the comparative analysis of credit card dataset demonstrates that features labeled Time, V5, V6, V7, V8, V9, V13, V15, V16, V18, V19, V20,V21, V22, V23, V24, V25, V26, V27, V28, amount do not contribute to the fraud prediction. Thus, we decide to consider them as irrelevant attributes and remove them from the original dataset. Table 3 presents the remain features.

Feature extraction
We applied the three dimensionality reduction algorithms PCA, t-SNE and UMAP on our credit card datasets to generate the robust and discriminative features for fraudulent instances, which will aid in detecting effective fraudulent transactions. Figure 8 shows the performance of each algorithm applied on Dataset-1. For each case, the dataset were reduced into the three dimensional space using default parameters, and the plots were colored according to the label of each data point in Dataset-1, namely the purple color is used to represent normal transactions and the orange color represents fraudulent transactions. It can be seen that PCA does not present good discrimination, whereas UMAP and t-SNE show very good discrimination. However, Comparing t-SNE to UMAP, the latter is more able to preserve as much the local and the global data structure, with superior runtime performance. Based on this, we choose UMAP as a reduction algorithm to extract the embedding features that will be used during training and testing phases.

Implementation and evaluation metrics
We employ Long Short Term Memory networks to model the sequential dependency between consecutive transactions of credit card holders. The hidden state architecture of LSTM allows establishing connections between neural networks' nodes across time steps. Therefore, the model can retain information from past inputs, allowing it to identify temporal associations between events that may be dispersed in the input sequence. LSTM is an adequate model of succession patterns in sequential data points where the occurrence of one event may depend further back in time on the presence of several other events. However, there are still much more aspects to improve: 1. LSTM networks have to represent the entire input sequence x 1 ,x 2 ,…,x n as a single vector c, which can cause information loss since all information needs to be compressed into c. Furthermore, it need to decode the passed information from this single vector only, witch is a highly complex task. Thus, the performance of the LSTM networks degrade rapidly as the length of the input sequence increases. Fig. 8 The performance of different dimensional reduction algorithms on credit card fraud dataset. Feature size is reduced to dimension 3 by a PCA, b t-SNE and c UMAP 2. LSTM networks process the input sequence elements with the same manner, there is no way to give more importance to some of the input elements compared to others while processing the sequence.
To address the problem above, we propose to add the attention mechanism upon LSTM layers which enables the classifier to automatically extract global dependencies from the sequence of transactions and focus on the data items that are most relevant to the classification task. Our deep learning model is composed of 6 layers namely: Two layered LSTM networks followed by dropout on each layer, an attention layer added before the LSTM layer as depicted in Fig. 9. The LSTM layer takes the output of the attention layer as the input with the activation function assumed to be tanh. At the end of the two LSTM layers, we add a dense layer to obtain two valued outputs which are the prediction classes (normal transaction and fraud transaction). Finally, BatchNormalization layer is applied after the dense layer. The output of the BatchNormalization layer is passed into a softmax classification layer. For comparison purpose, we also give the LSTM network without the attention layer in Fig. 10.
This model is based on Keras deep learning framework which is an opensource neural network library written in Python. The detailed workflow of the proposed model can be summarized as in Algorithm 1 Algorithm 1: Workflow of our proposed prediction model Input: Historical credit card transactions collected up to time n : x 1 ,x 2 ,...,x n Output: Prediction of fraud based on sequential transactions of Credit Card holder 1 Start 2 Divide dataset into training, validation, and testing sets. 3 Credit Card data is taken and reshaped into a three dimensional tensor (N, L, F) where N is the number of training sequences, L is the sequence length and F is the number of features of each sequence. 4 A network structure is built with 6 layers, where there is one input layer and one attention layer followed by two LSTM networks in the next layers, a dense layer in the subsequent layer to obtain two valued outputs which are the prediction classes (normal and fraud), and finally a BatchNormalization layer is applied after the dense layer. 5 Define learning parameters (memory size, learning rate, batch size and epochs) and set tensor variables for weight and bias vectors. 6 Define cross-entropy loss function and add Adam optimization function to minimize the cross-entropy loss function. 7 Train the constructed network on the Credit Card data. 8 Use the output of the last layer as prediction of the next time step. 9 while Optimal convergence is not reached do 10 Compute training error.

11
Compute validation error. 12 Update weights and biases using back propagation. 13 Obtain predictions by providing test data as input to the trained network. 14 Evaluate accuracy by comparing predictions made with actual data. 15 End In credit card fraud domain, fraud detection systems try to reduce the false positive and false negative rate, knowing that the latter (FN) has severe costs on financial institutions as well as a decrease in customer satisfaction.
To assess the performance of our proposed fraud detection system with more accuracy, we use the confusion matrix as shown in Table 4. From this matrix, the following evaluation Table 4 Classification confusion matrix However, when evaluating fraud detection models, financial institutions have to face many challenges, such as false positive rate and false negative rate. False positives (FP) are cases classified by the Fraud Detection System (FDS) as fraudulent transactions but represent in reality normal behaviors. These cases, although their resulted errors during classification, do not cause significant damage to financial institutions. False negatives (FN), on the other side, are cases identified wrongly by the FDS as normal transactions but are truly fraudulent ones which has severe costs on financial institutions as well as a decrease in customer satisfaction. Therefore, we will be interested more to the sensitivity (Recall) metric that gives the accuracy on positive (fraud) cases classification, which is the most appropriate evaluation metric in this domain to check the effectiveness of our proposed model.

Results and discussion
Our study is based on the aforementioned processed credit card fraud datasets characterized by the temporally ordered sequence of transactions which allow our proposed model to predict the label of a transaction after having seen several transactions that precede it. Each dataset is divided into three sets. The first 70% subset of data is the training set used for training the models, the second 15% subset of data is the validation set used for validating the classifiers to avoid overfitting and improve model performance and the last 15% test subset of data is used to test the network generalization. Same training set and testing set of the credit card data are chosen for comparison between our proposed model and the baseline LSTM model.
The accuracy and recall plots for both models applied for example over our dataset named Dataset-1 are presented in Fig. 11, from which we see that our model (LSTMattention) achieved the more superior accuracy and sensitivity (recall) rates. This (18) significant improvement is because by using attention mechanism, more relevant patterns can be extracted from sequence transactions which allow the sequence classifier to automatically focus on the data items that are the most important to the classification task by a data-driven weighted average of each transaction contained in the sequence which results in an improved detection performance. Moreover, to highlight the classification performance of our proposed model, in terms of sensitivity, we present a visualization of the confusion matrix performed on each model applied for example over our dataset Dataset-1 (Fig. 12) from which we illustrate that our proposed model has a good capability to minimize the number of fraudulent transactions classified as normal and catch the rare fraudulent transactions, which is of great importance in real life for financial service providers.
As well, to assess the analysis of our experimental results, we compared our work with state-of-the-art fraud detection models listed in Table 5. The major reason for selecting these models is that they exhibit promising performances and they use the same dataset Dataset-1 described in this work, hence making the comparison more practical and reliable. Table 5 shows the performance values of each used model, in term of accuracy, precision and sensitivity (recall). The latter metric is of high importance in fraud detection domain, since financial institutions are interested more in detecting fraud instances that may occur, to protect consumers' interests and reduce the heavy annual financial losses caused by fraud.  As we can see from these experimental results, our proposed model achieves better results than the compared classification GRU, LSTM, SVM, KNN and ANN methods, which demonstrate the effectiveness of our proposed model in this paper on credit card fraud detection task.

Conclusion
In this paper, we aimed to improve the prediction efficiency during the identification of fraudulent transactions, by combining the strength of different Machine Learning techniques, namely: The Swarm intelligence based approach to select the optimal subset of relevant features, the Uniform Manifold Approximation and Projection (UMAP) method to reduce the dataset dimensionality, the Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced data, the sequence learner LSTM networks to model long term dependencies within transaction sequences and the attention mechanism to automatically focus on the data items that are the most relevant to the classification task. Thus, our proposed model is capable to catch useful patterns within consumer behavior which helps to distinguish effectively fraudulent transactions from the normal ones.
To validate our results, we performed our model on two different credit card datasets, and it shows its ability to deliver a high sensitivity performance during the detection of fraudulent instances that are of great interest in this domain. Furthermore, in terms of comparison with recent works, our model provides a very good performance.
As a future work, we plan to study a novel credit card fraud detection model that relies solely on attention and transformers architecture without using any recurrent networks to process sequences.