- Research
- Open access
- Published:
IPerFEX-2023: Indonesian personal financial entity extraction using indoBERT-BiGRU-CRF model
Journal of Big Data volume 11, Article number: 139 (2024)
Abstract
There is minimal research focusing on applications of Indonesian named entity recognition (NER) in a specific domain. This study proposes an Indonesian personal financial entity extraction task that can be utilized in a financial assistant chatbot system to interpret user’s financial situation for personalization. Due to the simplicity that a chatbot has, it can promote financial management practices to youth as early as possible in their career. However, the challenge in financial NER is numerical entity extraction that relies heavily on contextual information and suffers the out-of-vocabulary (OOV) problem. Therefore, to extract 15 personal financial entities in daily Indonesian discussions (expense-type, expense-amount, income-type, income-amount, asset-type, asset-amount, saving-type, saving-amount, liability-type, liability-amount, family, time, financial-goal, age, and occupation), this research proposes a dataset, IPerFEX-2023, trained using the Bidirectional Gated Recurrent Unit BiGRU and Conditional Random Field (CRF) with Indonesian Bidirectional Encoder Representations from Transformers (IndoBERT) pre-trained model for feature embeddings (IndoBERT-BiGRU-CRF). It is compared with the corresponding Bidirectional Long Short-Term Memory (BiLSTM) (IndoBERT-BiLSTM-CRF) as baseline. Not only the IndoBERT-BiGRU-CRF model achieves the best performance with a 0.73 F1-score, but it is also 14% faster on average compared to the corresponding baseline model due to its simpler unit structure. This paper also discusses future directions covering model enhancement strategy based on the error analysis result and complementary tasks needed to complete personalization
Introdution
Named entity recognition (NER) is a subfield of information extraction that aims to identify words or phrases from unstructured text data that are relevant to a set of predefined categories [1,2,3,4]. The majority of Indonesian NER works extract entities such as a person, location, and organization [5, 6]. This study attempts to implement NER in the personal finance domain to extract key personal financial entity information from daily Indonesian discussions. It can be utilized in a financial assistant chatbot system to achieve personalization by extracting personal financial information that is relevant to the financial condition of each user. This way the chatbot system can provide tailored and personalized financial guidance to achieve the user’s financial goals based on their personal financial condition. To the best of the authors’ knowledge, the proposed task is the first of its kind across any language. This study proposes a self-collected dataset on daily Indonesian discussions that were manually annotated with 15 personal financial entities that include expense-type, expense-amount, income-type, income-amount, asset-type, asset-amount, saving-type, saving-amount, liability-type, liability-amount, family, time, financial-goal, age, and occupation. Unlike other NER problems, the proposed task involves numerical extraction. The main challenge is that numerical values are not self-contained. They rely heavily on contextual information from neighboring entities. In addition, extracting numerical values leads to the out-of-vocabulary (OOV) problem since it is impossible to encode all numerical combinations. Relevant literature are utilizing the bidirectional encoder representations from transformers (BERT) to transform the input text into a low-dimensional contextual word embedding. Recent literature [2] stated that a feature-based approach using an IndoBERT-BiLSTM-CRF architecture outperforms the fine-tuning approach. Therefore, this study uses this architecture as a baseline. The IndoBERT language model that is proposed by [7] has a crucial role in this study. On one hand, it infuses contextual semantic information into word embeddings. In addition, the WordPiece tokenization mechanism breaks down infrequent words into a combination of frequent sub-words. This tokenization mechanism is suitable to minimize the OOV problem in numerical value extraction that is present in this study. Another recent literature [1] experimented with a gated recurrent unit (GRU) which has a simpler unit structure compared to the LSTM. They claimed that not only the BERT-BiGRU-CRF model outperforms the corresponding LSTM model, but it also has a shorter training time. Due to the advantages that a GRU offers, this study proposes the IndoBERT-BiGRU-CRF model using the feature-based approach. A single LSTM or GRU layer can only extract information from the past (left) without knowing about the future (right). However, in most NLP problems like sequence classification, it is better to have information and context about the past and the future. The effective solution is by adopting a Bidirectional LSTM or Bidirectional GRU, where there are 2 hidden states, 1 is to access previous context while the other to access future context or upcoming words.
The contributions of this study are as follows:
-
Offer a self-collected Indonesian NER dataset in the personal finance domain.
-
Train and evaluate the proposed dataset using state-of-the-art (SOTA) models.
-
Perform quantitative analysis describing the role of each layer in the architecture by using the following additional architectures: BiGRU, BiLSTM, BiGRU-CRF, and BiLSTM-CRF.
-
Perform error analysis to prioritize the next steps for model enhancement.
-
Identify complementary tasks needed to complete personalization.
-
Integrate research outcomes into a system for demonstration purposes.
Recent work
Financial tecnology
One of the state-of-the-art in financial technology is a chatbot or personal assistant. A financial assistant chatbot system provides an easy to access, simple, fast, secure, affordable, and personalized financial service that guides users in maintaining financial independence starting from managing day-to-day finances, making investing decisions until preparing financial strategies to achieve users’ financial goals. Due to the ease of use that a chatbot has, it can promote financial inclusion to the underserved individuals and encourage youth to start financial management as early as possible in their career [8]. By starting financial management early, more time is available for assets to compound [9]. Moreover, technology advancements allow the advent of personalized financial services [8]. This study marks the first step by using deep learning models to extract key personal financial information from daily Indonesian discussions.
In order to build the system mentioned above, several techniques can be applied to extract information from the customer’s queries or texts, one of them is the Entity Extraction method. Entity extraction is a complex problem, due to the frequent reuse of words that require additional context from the input sequence to avoid ambiguity [10]. The main challenge in this study is identifying identical words, phrases, or numerical entities of different types. For instance, “dapat gaji“ is an income-type, whereas “bayar gaji“ is an expense-type entity. This problem is amplified in this study due to the presence of numerical entities. The phrase “5 juta“ in the following sentences is only distinguishable by attending towards preceding entities.
-
1
INCOME-TYPE Gaji saya INCOME-AMOUNT 5 juta.
-
2
EXPENSE-TYPE Biaya makan saya EXPENSE-AMOUNT 5 juta.
This study uses the IOB format annotation which follows the following guideline: (1) I: inside a phrase; (2) O: outside a phrase; (3) B: beginning of a phrase [11]. This annotation format is especially useful when separating neighboring entities of the same type as shown in Table 1.
Learning architectures
The current research on NER uses deep learning with a bidirectional recurrent neural network (RNN) and a CRF layer. Recent literatures utilize the BERT language model to provide contextual features as input. This approach is used in multiple languages, including Chinese [1, 12,13,14,15], Indonesian [2, 16,17,18], Portuguese [3], Persian [19], and English [20, 21]. The latest Indonesian NER work [2] confirmed the advantages of the feature-based approach by training a BiLSTM-CRF model with different feature embeddings and comparing it with the fine-tuned approach. The feature-based approach with IndoBERT language model proposed by [7] achieved the best performance when compared to other language models such as FastText, mBERT, and XLM-R with an average F1-Score of 94.90 ± 0.31.
A recent NER work [1] experimented with replacing the LSTM with GRU and claimed that not only the GRU-based model has shorter training time due to its simpler unit structure, but it also outperforms the corresponding LSTM model. Due to this recent finding, this study proposes an IndoBERT-BiGRU-CRF model with an IndoBERT-BiLSTM-CRF model as baseline. Figure 1 illustrates the proposed model architecture. The IndoBERT layer creates contextual word embeddings in the form of dense vectors that are passed on to the BiLSTM/BiGRU layer for temporal feature extraction between embeddings. In the final layer, a CRF layer is used to solve dependencies between tags.
Bidirectional encoder representation from transformers (BERT)
BERT is the encoder component of the transformer architecture which transforms each token from the input sequence into a dense vector with contextual semantic information infused [1]. A transformer model uses a multi-head self-attention mechanism that is computed using a scaled dot-product as shown in Eq. 1, where Q represents the information of the current word in the encoder layer or decoder layer; K is the vector input and V represents the information of the aggregated words representation. Moreover, \(d_k\) is the dimension of a single query, key or value input that is used to scale the dot product output before passing it forward into the softmax function [22]. The inputs are composed of query, key, and value that are uniquely mapped from the input sequence by use of distinct linear layers with parameters trained during the training process. The separate linear layers lead to different input representations. The multi-head component of this mechanism allows a transformer model to attend to different versions of the query, key, and value inputs. The outputs from all the attention heads are concatenated as shown in Eq. 2 where \(head_i\) represents the output of each attention head. The concatenated result is then linearly projected to restore the original vector dimension size.
Before the advent of the transformer model, attentions were implemented using an encoder-decoder RNN. However, the training time suffers due to the sequential computation. In a transformer architecture, computations are done in parallel to speed up the training process. Unfortunately, parallel processing loses the ordering of words which is crucial in processing language. In order to retain the order of the input sequence, a transformer model uses a positional encoding mechanism defined in Eqs. 3 and 4, where pos is the position of the token in the input sequence, i is the current index of the word embedding vector, and \(d_{model}\) is the dimension of the word embedding vector.
Aside from the position and word embeddings, the BERT input embedding is also composed of segment embedding to indicate the different sentences in the input sequence. The segment embedding is crucial in tasks that require more than a single sentence in the input, such as question answering. BERT is jointly trained using the masked language model (MLM) and next sentence prediction (NSP) [23]. The MLM technique replaces random tokens in the input sequence with a mask token “[MASK]”. The model then predicts the masked tokens based on the context of neighboring words. The NSP technique is used to predict text-pair representations. Another advantage that BERT offers is the WordPiece tokenization. This tokenization mechanism splits infrequent words into a combination of more frequent sub-words. The succeeding sub-words are annotated with a “##” prefix to allow recovery of the original word [24].
Responding to the lack of NLP resources in Indonesia, a pre-trained Indonesian monolingual BERT model, called as IndoBERT, is proposed by [7]. The language model is trained on a 23.43 GB Indonesian dataset (Indo4B) that was collected from multiple public sources and existing Indonesian datasets. This study uses the IndoBERT language model to infuse contextual semantic information in a bidirectional manner to the word embedding. Since this study is in the finance domain, context has a crucial role in recognizing financial entities. For instance, several asset-related entities are similar with saving-related entities as shown in the following example. The two asset-type and saving-type entities have different contextual meanings. In the second sentence, it does not mean that the user is capable of investing Rp 500 million each month in stock ABCD8.
-
1
...setiap bulan invest di SAVING-TYPE saham XXX SAVING-AMOUNT 50 jt.
-
2
... punya ASSET-TYPE saham ABCD ASSET-AMOUNT 500 juta.
Long short-term memory (LSTM)
The drawback of using a traditional RNN is the vanishing gradient problem that occurs during backpropagation [25]. Backpropagation is a technique used in deep learning models to learn from training data. If an input x is passed through the neural network, an output \(\hat{y}\) is computed based on the network’s parameters (weights and biases). During backpropagation, a set of chain rule differentiation is computed to determine the direction in which the parameters must be adjusted such that it minimizes the prediction loss based on the algorithm denoted in the Eq. 10. Since backpropagation consists of multiplications of gradients in the range of 0–1, it will eventually reach a very small number which causes the model to stop learning. LSTM was introduced as a solution to the vanishing gradient problem. The LSTM computes the new hidden state, \(h_t\), based on the new cell state, \(c_t\), as shown in Eq. 10, where * represents an element-wise multiplication. The new cell state, \(c_t\), is computed based on the cell state of the previous timestep, \(c_{(t-1)}\), and the candidate cell state, \(\tilde{c}_t\), using Eq. 9. Whereas the candidate cell state, \(\tilde{c}_t\), is calculated based on the current input value, \(x_t\), and the hidden state from the previous timestep, \(h_{(t-1)}\), using Eq. 8. The unit structure of an LSTM is composed of three gates: input, forget, and output gates. These gates are computed using a sigmoid function that normalizes the result to be in the range of 0–1. The input gate, \(i_t\), is computed using Eq. 5 that determines the portion of the candidate state, \(\tilde{c}_t\), in the new cell state, \(c_t\). The closer the value is to 1, the more the candidate state, \(\tilde{c}_t\), updates the cell state, \(c_t\). The forget gate, \(f_t\), is defined using Eq. 6 to determine how much of the cell state from the previous timestep, \(c_{(t-1)}\), is retained in the new cell state, \(c_t\). The closer the value is to 1, the more the cell state from the previous timestep, \(c_{(t-1)}\), is retained in the new cell state, \(c_t\), thus solving the vanishing gradient problem by retaining memory from previous timesteps. The output gate, \(o_t\), is computed using Eq. 7 where the result is used to calculate the value of the next state as shown in Eq. 10.
Gated recurrent unit (GRU)
A recent literature [26] proposed a simplified version of the LSTM called as the gated recurrent unit (GRU). GRU has a similar unit structure with the LSTM, but with only two gates instead of three. Reference [27] claimed that the GRU can achieve better generalization compared to the LSTM and train faster due to the simpler unit structure. At every timestep, a candidate cell state, \(\tilde{h}_t\) is computed to update the cell state, \(h_t\), using a tanh function based on the current input value, \(x_t\), output of the reset gate, \(r_t\), and the cell state from the previous timestep, \(h_{(t-1)}\), as shown in Eq. 14. The reset gate, \(r_t\), decides how much of the cell state from the previous timestep, \(h_{(t-1)}\), is relevant in computing the new candidate cell state, \(\tilde{h}_t\). The update gate, \(z_t\), determines how much of the new cell state, \(h_t\), is made up of the candidate state, \(\tilde{h}_t\) and the cell state from the previous timestep, \(h_{(t-1)}\), as shown in Eq. 13. One of the differences between LSTM and GRU is that the GRU only uses one update gate, \(z_t\), to control the effect of the cell state from the previous timestep, \(h_{(t-1)}\), and the candidate state, \(\tilde{h}_t\), towards the new cell state, \(h_t\), as shown in Eq. 14. On the other hand, LSTM uses two separate gates to control the new cell state as shown in Eq. 8. The reset gate, \(r_t\), and the update gate, \(z_t\), are computed using the sigmoid function as shown in Eq. 11 and Eq. 12 respectively. The update value, \(z_t\), is computed using the input value, \(x_t\), and the cell state from the previous timestep, \(h_{(t-1)}\). The closer the value of the update gate, \(z_t\), is to 1, the more the candidate cell state, \(\tilde{h}_t\), updates the cell state compared to the cell state from the previous timestep, \(h_{(t-1)}\), as shown in Eq. 14. Since the update gate can be set close to 0, the new cell state, \(h_t\), can be nearly equal to the cell state from the previous timestep, \(h_{(t-1)}\). This proves that this mechanism can retain values in the cell state even after multiple timesteps. Thus, eliminating the vanishing gradient problem to have longer memory.
Conditional random field (CRF)
The IOB format contains dependencies between adjacent tags where certain tags cannot flow to other tags. For instance, a B-expense-type tag cannot be followed by an I-income-type tag. In this study, a CRF layer decodes the output from the bidirectional RNN by taking into consideration neighboring tags [28]. Given a sequential data, \(x=(x_1,\dots ,x_n)\), with label, \(y=(y_1,\dots ,y_n)\), CRF computes the likelihood of y given x using Eq. 15. The negative value of the log of P(y|x) gives the log-likelihood loss that can be used for training [29]. where Z(x) is the normalizing factor, \(s(y_i,x,i)\) is the label score of \(y_i\) at position i and \(t(y_(i-1),y_i,x,i)\) is the transition score from \(y_(i-1)\) to \(y_i\).
Optimizer and evaluation
Deep neural networks are prone to overfitting that can lead to poor model generalization. It can be solved using several approaches such as regularization, early stopping, and the selection of optimizer algorithm. This study uses the Adam and AdamW optimizer algorithms to update model’s parameters using an adaptive learning rate. The Adam algorithm uses Eq. 17 to update the model’s parameters. where \(\theta _t\) is the updated parameter, \(\theta _{(t-1)}\) is the parameter from the previous timestep, \(\alpha\) is the learning rate, and \(\epsilon\) (epsilon) is a small number used to prevent division by 0. Values \(\hat{m}_t\) and \(\hat{v}_t\) are obtained from a series of computations described in Eq. 17–21 where values \(\beta _1\) and \(\beta _2\) represent decay rates for the first moment and second moment estimates respectively [30]. The gradient, g, of the function, f, is computed using Eq. 17. The first and second moment estimates are computed using Eq. 19 and 20 respectively. The raw moment estimates, \(m_t\) and \(v_t\), are observed to be biased towards 0 during the initial epochs. Therefore, a bias-corrected moment estimates, \(\hat{m}_t\) and \(\hat{v}_t\), are used to update the final parameter. The bias correction mechanism for each moment estimates is defined in Eq. 19 and 21.
Unfortunately, the stochastic gradient descent (SGD) algorithm is observed to have better generalization compared to the adam algorithm [31]. Reference [31] experimented with an improved version of adam, called as adamW, by reallocating the weight decay regularization to the parameter update so that it is proportional to the parameter as shown in Eq. 22, where \(\lambda\) is a weight decay hyperparameter that can be tuned. The adamW algorithm achieves a competitive result with SGD. Several evaluation metrics widely used in NER are precision, recall, and F1-Score[1]. Precision measures the percentage of predicted positives that are correctly predicted. Recall measures the percentage of actual positives that are correctly predicted. F1-Score takes the harmonic mean of precision and recall, thus taking into consideration both false positives and false negatives. TP (true positive) is the number of correctly predicted positives, FP (false positive) is the number of actual negatives predicted as positive, and FN (false negative) is the number of actual positives predicted as negative.
Research methodology
Problem identification
Based on a quick requirement gathering survey conducted by the authors, respondents expect that a financial assistant chatbot would provide personalized services such as advising changes in financial plan/investments, assisting users to work towards their financial goals, and suggesting investments based on their needs and priorities. Without personalization, a financial advice can be optimized for one individual but disparaging for another. Therefore, this paper attempts to build a personal financial entity extraction model trained on daily Indonesian discussions to aid the chatbot system to understand the financial situation of each user. The proposed task can also be considered as applying NER in the personal finance domain. Given the following sample text, the goal of this study to identify the following key financial information: (1) Monthly savings: Rp8.750.000; (2) Financial goal: nikah; (3) Timespan: 2028:
Saya bisa investasi per bulan di angka Rp8.750.000. Target saya sekarang nikah di tahun 2028
We hypothesize that SOTA NER models can be used to extract financial information that include numerical values as shown in the previous illustration.
Data collection
Existing Indonesian NER resources support entities such as person, name, location, and organization [5, 6, 32]. This study presents a dataset scraped from Quora,Footnote 1 a social question-and-answering platform, to capture daily Indonesian messaging behavior. The data was collected by scraping Quora using a python script with Selenium and ChromeDriver to address the dynamic rendering of webpages, and annotated by the first author. The annotation task took a period of two weeks by the first author, who is an Indonesian native speaker, that involved identifying and tagging named entities in the text, including expense type, expense amount, income type, income amount and other personal financial terms.
The dataset comes in two versions as shown in Table 2. To facilitate future research and comparison of results, the dataset has been randomly split into training, validation and test sets with a ratio of 70%:15%:15% respectively. The smaller version was used to validate the feasibility of the proposed task. After obtaining promising results from the first trial, the second version was created by annotating additional data to reach a similar size with benchmark datasets [5, 6, 32].
Data cleansing and annotation
The proposed dataset were preprocessed using the following procedures: (1) Ensuring a single space after a punctuation; (2) Removing spaces between full stops in currency values; (3) Removing emoticons, symbols, and special characters; (4) Removing trailing spaces; (5) Replacing newline characters with a single space; (6) Replacing multiple full stops with a single full stop.
A python script was developed to accelerate the annotation process to follow a structured approach and was carried out in the following stages: (1) Skips duplicate sentence; (2) Displays any differences between original and cleansed text; (3) Prompts an option to exclude the current text; (4) Named entity annotation of personal financial labels using the IOB format; (5) Displays the final JSON object before appending it to the data source. Figure 2 shows a snapshot of the annotation process (see Table 3).
Data preprocessing and hyperparameter setting
The annotated datasets were exported in a CSV format, with each row containing a word with the corresponding named entity type as shown in a sample portion in Table 4. The CSV format was chosen for its ease of use, as it can be easily loaded into popular machine learning frameworks and libraries. To ensure the quality and consistency of the datasets, the following validation procedures were performed: (1) The absence of duplicates within and between datasets; (2) The absence of false tag relationship annotation errors that violate the IOB format. The value count of each of the 15 entities and 1 OOV is presented in Table 5.
After the validation procedures, the dataset was preprocessed to prepare the input for training. The first step in the preprocessing involved mapping words and labels into integer values. The mapping was created by assigning a unique integer value to each word and label in the dataset, and two special tokens were added to the vocabulary: a padding token (“[ENDPAD]”) and an OOV token (“[OOV]”). This mapping procedure allowed the input data to be fed into the first embedding layer in models without feature representation. For models with feature representation, only a label encoder was needed to encode the labels since text was tokenized and vectorized by the IndoBERT language model.
The models used in this study are divided into two main categories: (1) models with feature representation (IndoBERT-BiGRU-CRF and IndoBERT-BiLSTM-CRF); (2) models without feature representation (BiGRU, BiLSTM, BiGRU-CRF, and BiLSTM-CRF). Models without feature representation were built using keras with tensorflow framework, whereas models with feature representation were built using pytorch framework with the aid of a graphics processing unit (GPU) to speed up the training process. All training scripts were developed using the python programming language. Table 6 summarizes the hyperparameter settings used.
The early stopping setting (maximum epochs: 200 and patience: 5) was inspired by [2]. This mechanism was used to ensure model training ended during convergence to avoid overfitting by monitoring the validation loss with minimum mode. The dropout rate, number of RNN layers and number of RNN hidden unit settings were also inspired by the same work. Models without feature representation used the adam optimizer algorithm with a learning rate of 0.001, whereas models with feature representation used the adamW optimizer with a starting learning rate of 5e-5 and decay rate of 0.01.
Initially, the starting learning rate and decay rate used were 0.015 and 0.05 respectively, inspired by [1]. However, the early stopping mechanism was never triggered. The starting learning rate was then decreased to 5e-5 that was inspired by relevant works [3, 15, 33, 34]. Considering the small learning rate used, the decay rate was decreased to 0.01. This combination reached convergence that triggered the early stopping mechanism. Performance results were compared between the initial and final combination. Results showed that the two set of combinations achieved similar results which led to a conclusion that the model diverged around the minimum when the initial combination was used. The batch size and the maximum sequence length were tuned to cope with the computation resource limitation.
Model training
Figure 3 illustrates the training flow used in this study with GRU used as an example. In the preprocessing stage for models without feature representation, both the training and validation sets were padded using the “[ENDPAD]” token and OOV words in the validation set were replaced with the “[OOV]” token. Words were then mapped into integers using the mapping object created during the data preprocessing stage. In models with feature representation, a torch data loader was prepared to iterate through each set by using a customized map-style dataset with additional procedures that include: 1) Tokenization; 2) Adding of BERT special tokens (“[CLS]” and “[SEP]”); 3) Padding (“[PAD]”); 4) Initializing mask and segment ids.
Evaluation and implementation
This study uses the F1-Score for evaluation with the help of the seqeval library that can generate a classification report specialized for NER [35]. For demonstration purposes, this study compiled an integrated system using Flask API with two routes as shown in Table 5. The integrated system is equipped with a WordPiece normalization procedure to transform the BERT tokenizexd tokens into human readable by aligning it with the original input text. Figure 4 shows a screenshot of the integrated system.
Results and discussion
The F1-Score results for each entity are presented in Table 7. Column 1 indicates IndoBERT-BiGRU-CRF architecure, column 2 refers to IndoBERT-BiLSTM-CRF, column 3 indicates BiGRU-CRF, column 4 refers to BiLSTM-CRF, column 5 indicates BiGRU, and column 6 refers to BiLSTM architecure. The average F1-Score for each model is visualized in Fig. 5. Figure 5 shows that the IndoBERT language model significantly increases the performance models by an average of 49.3%. GRU-based models outperform LSTM-based models in both dataset sizes. This observation is in accordance with recent literatures that claimed the GRU to achieve better generalization in small to medium dataset sizes compared to the LSTM [36]. With the addition of 552 sentences in the large training set, it is still considered small when compared with international NER training datasets that exceeded 14,000 sentences [1, 37]. The CRF layer increases the F1-Score by an average of 4.4%.
Error analysis
The best performing model is the IndoBERT-BiGRU-CRF trained using the small dataset with an average F1-Score of 0.73. The model’s performance decreased by 0.04 when trained using the large dataset. This observation contradicts common literature findings that state an increase in dataset size should improve the model’s performance since there are more data for the neural network weights to be adjusted. Reference [38] observed that a slight increase in dataset size does not guarantee an increase in performance since the class distribution may still not be well represented enough in the larger dataset. The data increase (552 sentences) in the proposed training set is not significant enough to improve the entity distribution as it is not reflected by an increase in performance. Table 8 shows the average F1-Score for each entity with bolded rows annotating entities that registered an increase in F1-Score. A pattern is observed between the bolded entities and the additional thread topics used in the large dataset presented in Table 2. The first two additional threads have a liability related topic, whereas the remaining additional threads have a middle to high income audience. This leads to a significant increase in liability and asset related entities, where the asset-type and liability-type entities increased in value count by 4 and 1 rank(s) in the large dataset respectively as shown in Table 5. Asset-amount and liability-amount entities increased in value counts but remain in the same rank position in the large dataset.
The asset-amount entity registered a decrease in F1-Score by 0.01, whereas the liability-amount entity registered a significant increase in F1-Score by 0.19. We hypothesize that this is due to the explicit characteristic that the liability-type entity has, which facilitates the model in recognizing the corresponding amount entity. Based on this observation, the classifier succeeds in recognizing the type-to-amount relationship well. The explicit characteristic of the liability-type entity is shown by the fact that 92% of liability-type entities in the large dataset contain the following words: “CC”, “Pinjam”, ”Kartu Kredit”, ”Cicil”, ”kpr”, and ”utang”. On the other hand, it requires additional context to classify asset-related entities (see Table 9).
In the following sentences, the amount entity (“5 juta”) can only be accurately recognized by attending towards preceding type entities.
-
1
“[INCOME-TYPE Gaji] saya [INCOME-AMOUNT 5 juta].”
-
2
“[EXPENSE-TYPE Pengeluaran per bulan] saya [EXPENSE-AMOUNT 5 juta].”
-
3
“[SAVING-TYPE Investasi per bulan] saya [SAVING-AMOUNT 5 juta].”
-
4
“Saya punya [ASSET-TYPE motor] harga[ASSET-AMOUNT 5 juta].”
-
5
“[LIABILITY-TYPE Cicilan kpr per bulan] saya [INCOME-AMOUNT 5 juta].”
For implicit entities, the model must grasp sufficient context from the sentence. Several example cases that require additional context are shown in Table 10. The first sentence in table IX shows an example that requires additional context to differentiate assets from savings. The IndoBERT-BiGRU-CRF (small) misclassifies the amount phrase (“500 juta”) as a saving-amount entity, despite correctly classifying the preceding type entity as an asset-type entity. This sentence does not infer that the user can invest 500 million into GOTO stocks periodically. Savings are assets, however a difference between the two is crucial to distinguish the total asset valuation that a user has with the periodic investment that the user can afford. The second sentence shows an example sentence that requires context to differentiate a financial-goal with an expense. The IndoBERT-BiLSTM-CRF (large) misclassifies the sentence as an expense. The correct classification is financial-goal due to the word “pengen” in the sentence that means “want”. If the sentence is modified to “Saya pengen beli rumah 3 tahun lagi”, both models classify the sentence accurately as a financial-goal along with the time entity in the appended phrase.
Both models are deceived to classify the last sentence as an income due to the word “gaji” present in the sentence. Majority of “gaji” entities in the dataset are related to income. Therefore, it requires deep contextual information to accurately predict the last sentence as an expense. The IndoBERT-BiLSTM-CRF (large) model starts to point in the right direction by predicting the word “supir” as an expense-type. However, this leads to a false tag relationship error that violates the IOB format. Even though there is no notable increase in F1-Score when trained using the large dataset, but there are new variations that the model offers. Since the last two additional threads used in the large dataset have a middle to high income audience, it leads to new cases where users are paying salaries of their employees as shown in the first sentence in Table 10. The model trained using the large dataset accurately classifies this sentence as an expense, however the model trained using the small dataset still classifies the sentence as an income. A similar behavior is also shown in the second sentence in Table 10, where the IndoBERT-BiGRU-CRF (small) classifies the entire sentence as expense. However, the IndoBERT-BiLSTM-CRF (large) model accurately classifies the phrase “uang makan” as an income, but still misclassifies the corresponding amount entity as an expense-amount.
Majority of misclassified sentences contain punctuations in the middle of the sentence. Sentence 1–2 in Table 11 are similar sentences with different punctuations. Sentence 1 is accurately identified by the model. However, a different prediction is made after introducing a comma as shown in the second sentence. The model excludes the word “2 M” as part of the financial-goal entity. This example shows that similar sentences with different punctuations may cause different predictions. This factor causes currency related entities to have a very low performance due to the numerous ways that currencies can be differently represented based on the punctuations used. This study does not include punctuation removal as part of the data cleansing process since it may change the numerical value as shown in the following examples: (1) Rp1.000,00 \(\ne\) Rp100000; (2) 150–200rb \(\ne\) 150200rb. Sentence 3 in Table 12 also shows an example where punctuation removal between amount entities can disrupt the sentence. A possible solution is by substituting numerical values with a generic token (“[NUM]”) to remove noise. This approach does not remove context since amount entities are recognized based on the preceding entities, not based on the numerical value.
Models without feature representation substitute OOV words with a generic token (“[OOV]”) that leads to information loss. On the other hand, the IndoBERT language model uses the WordPiece tokenization to overcome this problem. Since it is impossible to map all real numbers, OOV is crucial in the proposed task. Table 13 shows a sample sentence that contains the following OOV words: “Byr”, “mkn”, “675”, and “hri”. These OOV words are abbreviated Indonesian words that are often used to simplify typing. The BiGRU-CRF model misclassifies the sentence, whereas the IndoBERT-BiGRU-CRF model registers accurate predictions. This shows that the IndoBERT language model is effective in solving the OOV problem.
Table 14 presents examples of false tag relationship errors that violate the IOB format. It occurs when a middle entity (I-) has a different entity with the beginning entity (B-). Figure 6 visualizes the percentage of false tag relationship errors made by each model. It shows that the CRF layer succeeds in reducing the number of errors significantly. With the BiGRU-CRF and BiLSTM-CRF models trained using the large model, the error percentage decreases by 48% and 38% respectively. In the small dataset, the error percentage decreases by 25% and 21% respectively. This shows that a CRF layer solves dependencies between tags by reducing the percentage of false tag relationship errors. However, it does not eliminate this error entirely.
Table 9 presents the training time for each model. The number of training epochs vary from one model to another due to the early stopping mechanism used. For evaluation, the average training time in each iteration is used. The IndoBERT language model used in this study generates a word embedding that has a size of approximately 3x of the models without feature representation. This leads to more computations, thus longer training time. GPU was utilized to speed up the training process for models with IndoBERT feature representation. The training time between models with IndoBERT feature representation and models without feature representation is incomparable due to the different computation resource used during training. On average, a GRU-based model is 14% faster than the corresponding LSTM-based model. This is due to the simpler unit structure that a GRU has compared to the LSTM. Thus, it reduces the amount of parameter updates which leads to less computations and faster training time.
Conclusion
This research proposes a financial entity extraction model to extract 15 personal financial information that include expense-type, expense-amount, income-type, income-amount, saving-type, saving-amount, asset-type, asset-amount, liability-type, liability-amount, financial-goal, family, time, age, and occupation. The proposed task marks the first step in creating the foundation for personalization in a financial assistant chatbot system. With the limited training size and the large number of entities, SOTA feature-based algorithms with the IndoBERT language model succeed in registering promising results trained on the newly built dataset. Due to the advantages that the GRU offers, this study proposes the IndoBERT-BiGRU-CRF model with the IndoBERT-BiLSTM-CRF model as a baseline. On average, the IndoBERT-BiGRU-CRF achieves better generalization and faster training time due to the simpler unit structure. The IndoBERT-BiGRU-CRF model trained using the small dataset achieves the best performance with an average F1-Score of 0.73.
The IndoBERT language model registered an average increase of 0.23 F1-Score due to its ability to extract deep bidirectional contextual semantic information from the input sequence. The CRF layer solves dependencies between tags, decreasing false tag relationship error by 7% on average. The addition of 552 sentences in the dataset does not register a notable increase in F1-Score. This is due to the diversification of personal finance topics in the large dataset and the minimal data cleansing procedures used in handling noise. Punctuation removal is not used as part of the data cleansing procedure since it can alter currency values. On the bright side, topic diversification in the large dataset introduces new variations of contextual predictions that are not present in the model trained using the smaller dataset, despite the similar F1-Score. Future directions include the addition of data and experimenting with further data cleansing procedures that can denoise data from punctuations without altering numerical values. A possible approach is by masking numerical representations prior to cleansing. To continue building the personalization feature, the following complementary tasks are needed: (1) Timing rate classification task to comprehend timing rates in amount entities, such as “5 juta per 5 hari”, “5 juta per minggu”, “5 juta per bulan”, etc; (2) Relation extraction task to map relationships between entities, mainly pairing type-amount entities. These future works address missing pieces to comprehend a better personal finance story. Moreover, more advanced Natural Language Processing techniques can be implemented to extract more important information as features.
Availability of data and materials
Not applicable.
Notes
https://www.quora.com/
References
Qin Q, Zhao S, Liu C. A BERT-BiGRU-CRF model for entity recognition of Chinese electronic medical records. Complexity. 2021;2021:6631837.
Khairunnisa SO, Imankulova A, Komachi M. Towards a standardized dataset on indonesian named entity recognition. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop. 2020; 64–71.
Souza F, Nogueira R, Lotufo R. Portuguese named entity recognition using BERT-CRF. arXiv. 2019. https://doi.org/10.48550/arXiv.1909.10649.
Mansouri A, Affendey LS, Mamat A. Named entity recognition approaches. Int J Comput Sci Network Secur. 2008;8(2):339–44.
Syaifudin Y, Nurwidyantoro A. Quotations identification from Indonesian online news using rule-based method. In: 2016 International Seminar on Intelligent Technology and Its Applications (ISITIA). IEEE. 2016. 187–1942016.
Gultom Y, Wibowo WC. Automatic open domain information extraction from Indonesian text. In: 2017 International Workshop on Big Data and Information Security (IWBIS). IEEE. 2017. 23–30.
Wilie B, Vincentio K, Winata GI, Cahyawijaya S, Li X, Lim ZY, Soleman S, Mahendra R, Fung P, Bahar S, et al. IndoNLU: benchmark and resources for evaluating Indonesian natural language understanding. arXiv. 2020. https://doi.org/10.48550/arXiv.2009.05387.
OECD: Advancing the digital financial inclusion of Youth. 2020. https://www.oecd.org/daf/fin/financial-education/ advancing-the-digital-financial-inclusion-of-youth.htm.
Ovaska T, Sumell A. Increase interest in compound interest: economic growth and personal finance. J Econ Financ Educ. 2017;16(2):85–97.
Kamath U, Liu J, Whitaker J. Deep learning for NLP and speech recognition, vol. 84. Berlin: Springer; 2019.
Sang ETK. Text chunking by system combination. In: Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop. 2000.
Dai Z, Wang X, Ni P, Li Y, Li G, Bai X. Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records. In: 2019 12th International Congress on Image and Signal Processing, Biomedical Engineering and Informatics (CISP-BMEI). IEEE. 2019;1–5.
Wang Q, Su X. Research on named entity recognition methods in Chinese forest disease texts. Appl Sci. 2022;12(8):3885.
Zhu Y, Wang G. CAN-NER: convolutional attention network for Chinese named entity recognition. Proc NAACL-HLT. 2019. https://doi.org/10.48550/arXiv.1904.02141.
Yang J, Wang H, Tang Y, Yang F. Incorporating lexicon and character glyph and morphological features into BiLSTM-CRF for chinese medical ner. In: 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE). IEEE. 2021;12–17.
Wintaka DC, Bijaksana MA, Asror I. Named-entity recognition on Indonesian tweets using bidirectional LSTM-CRF. Procedia Comput Sci. 2019;157:221–8.
Kusuma JF, Chowanda A. Indonesian hate speech detection using IndoBERTweet and BiLSTM on twitter. JOIV Int J Inform Visualizat. 2023;7(3):773–80.
Chowanda A, Andangsari EW, Yesmaya V, Chen T-K, Fang H-L. BERT-BiLSTM architecture to modelling depression recognition for Indonesian text from English social media. In: 2023 4th International Conference on Artificial Intelligence and Data Sciences (AiDAS). IEEE. 2023;54–59
Poostchi H, Borzeshi EZ, Piccardi M. BiLSTM-CRF for Persian named-entity recognition Armanpersonercorpus: the first entity-annotated Persian dataset. In: LREC. 2018.
Panchendrarajan R, Amaresan A. Bidirectional LSTM-CRF for named entity recognition. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation. 2018.
Jie Z, Lu W. Dependency-guided LSTM-CRF for named entity recognition. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019;3862–3872.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inform Process Syst. 2017. https://doi.org/10.48550/arXiv.1706.03762.
Kenton JDM-WC, Toutanova LK. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. 2018. https://doi.org/10.48550/arXiv.1810.04805.
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv. 2016. https://doi.org/10.48550/arXiv.1609.08144.
Skansi S. Introduction to deep learning: from logical calculus to artificial intelligence. Berlin: Springer; 2018.
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: encoder-decoder approaches. arXiv. 2014. https://doi.org/10.48550/arXiv.1409.1259.
Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Workshop Deep Learn. 2014. https://doi.org/10.48550/arXiv.1412.3555.
Lee S, Ko Y. Named-entity recognition using automatic construction of training data from social media messaging apps. IEEE Access. 2020;8:222724–32.
Sun Z, Li Z, Wang H, He D, Lin Z, Deng Z. Fast structured decoding for sequence models. Adv Neural Inform Process Syst. 2019. https://doi.org/10.48550/arXiv.1910.11555.
Kingma DP, Ba JL. Adam: Amethod for stochastic optimization. arXiv. 2014. https://doi.org/10.48550/arXiv.1412.6980.
Loshchilov I, Hutter F. Decoupled weight decay regularization. Int Conf Learn Represent. 2018. https://doi.org/10.48550/arXiv.1711.05101.
Fachri M. Named entity recognition for Indonesian text using hidden Markov model. Yogyakarta: Universitas Gadjah Mada; 2014.
Gong Y, Mao L, Li C. Few-shot learning for named entity recognition based on BERT and two-level model fusion. Data Intell. 2021;3(4):568–77.
Vīksna R, Skadiņa I. Large language models for Latvian named entity recognition. Human Lang Technol Baltic Perspect. 2020. https://doi.org/10.3233/FAIA200603.
Nakayama H. seqeval: a python framework for sequence labeling evaluation. Software. 2018. https://github. com/chakki-works/seqeval.
Bhuvaneswari A, Thomas JTJ, Kesavan P. Embedded bi-directional GRU and LSTMLearning models to predict disasterson twitter data. Procedia Comput Sci. 2019;165:511–6.
Sang EFTK, De Meulder F. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. Development. 1837;922:1341.
Althnian A, AlSaeed D, Al-Baity H, Samha A, Dris AB, Alzakari N, Abou Elwafa A, Kurdi H. Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Appl Sci. 2021;11(2):796.
Funding
There is no funding available for this research
Author information
Authors and Affiliations
Contributions
All authors contribute equally. AC contributed to methodology, ED proposed algorithms and test the algorithms.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no Competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Dave, E., Chowanda, A. IPerFEX-2023: Indonesian personal financial entity extraction using indoBERT-BiGRU-CRF model. J Big Data 11, 139 (2024). https://doi.org/10.1186/s40537-024-00987-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537-024-00987-6