Framework for multi-criteria assessment of classification models for the purposes of credit scoring

Ziemba, Paweł; Becker, Jarosław; Becker, Aneta; Radomska-Zalas, Aleksandra

doi:10.1186/s40537-023-00768-7

Research
Open access
Published: 02 June 2023

Framework for multi-criteria assessment of classification models for the purposes of credit scoring

Paweł Ziemba¹,
Jarosław Becker²,
Aneta Becker³ &
…
Aleksandra Radomska-Zalas²

Journal of Big Data volume 10, Article number: 94 (2023) Cite this article

1950 Accesses
4 Citations
Metrics details

Abstract

The main dilemma in the case of classification tasks is to find—from among many combinations of methods, techniques and values of their parameters—such a structure of the classifier model that could achieve the best accuracy and efficiency. The aim of the article is to develop and practically verify a framework for multi-criteria evaluation of classification models for the purposes of credit scoring. The framework is based on the Multi-Criteria Decision Making (MCDM) method called PROSA (PROMETHEE for Sustainability Analysis), which brought added value to the modelling process, allowing the assessment of classifiers to include the consistency of the results obtained on the training set and the validation set, and the consistency of the classification results obtained for the data acquired in different time periods. The study considered two aggregation scenarios of TSC (Time periods, Sub-criteria, Criteria) and SCT (Sub-criteria, Criteria, Time periods), in which very similar results were obtained for the evaluation of classification models. The leading positions in the ranking were taken by borrower classification models using logistic regression and a small number of predictive variables. The obtained rankings were compared to the assessments of the expert team, which turned out to be very similar.

Introduction

The COVID-19 pandemic and related panic and restrictions have had a huge, negative impact on the global economy. The decline in potential labour income lowered consumer demand, and many business sectors either closed down or experienced financial difficulties [1]. The economic crisis caused by the pandemic is considered to be several times bigger than the global financial crisis of 2007–2009 [2]. In times of crisis, financial institutions, e.g. banks, have to limit the occurrence of risk in their activities [3]. In practice, the main types of risk that commercial banks face today are credit risk, interest rate risk and operational risk [4]. These risks are interrelated, e.g. as interest rates increase, the risk of floating interest rate loans increases. Of the above-mentioned risks, the main one is credit risk, which determines whether the borrower is able to repay the loan on time. Therefore, research on commercial banks’ credit risk is of significant theoretical and practical importance [4]. An important aspect in this context is the distinction between credit risk and the bank’s proficiency at evaluating credit risk and monitoring the loans it has made [5]. Banks use the so-called Credit Scoring Systems, which, based on the collected data about customers, conduct a credit risk analysis in order to make a final credit decision [6]. Credit risk assessment is most often performed on the basis of historical data [7] with the use of classification methods constituting the basis for the construction of classification models for the purposes of credit scoring [8].

The main dilemma in the case of classification tasks is the selection of an appropriate algorithm adapter to the problem under consideration [9]. The formalized description of the algorithm selection problem proposed in 1976 by Rice [10] takes the form of abstract 5-element models composed of performance measures and the problem space, algorithms, features and criteria. Wolpert and Macready [11] claim that there is no single algorithm that could achieve the best performance for all measures in a given problem domain. The results of classification algorithms must be carefully assessed and analysed, and this analysis must be correctly interpreted for further evaluation [12]. Empirical evaluation is the basis for verifying the potential of classification algorithms and models [13, 14].

Therefore, it seems that the proposed ranking of classification algorithms is a better approach to solving a specific classification problem than searching for one algorithm that meets all expectations [15]. According to Peng et al. [16], due to the fact that the ranking of classification algorithms requires the examination of several criteria, e.g. accuracy and precision, the choice of algorithm can be modelled as a multi-criteria decision problem. Classification models are built on the basis of classification algorithms, which are specific products of individual algorithms [17]. Accurate evaluation of classification models is one of the most important parts of the classification process [18], and the ranking of classification models, similarly to the ranking of algorithms, is also a multi-criteria problem. Multi-Criteria Decision Making (MCDM) methods are used in multi-criteria problems of evaluation and ranking of classification models.

MCDM methods are the basis for building decision models, just like classification algorithms are the basis for building classification models. In the case of MCDM methods, it was noted that decision-makers need to understand the method used [19]. Unfortunately, usually the decision maker is not an expert in the field of MCDM methods and has a limited understanding of a given method [20]. As a result, he treats a given method as a ‘black-box’, and this means that he does not trust the results of the MCDM method [21], and may even feel manipulated by the method [20]. In such a situation, it is a big challenge to increase the decision-maker’s confidence in the MCDM method used and the decisions it recommends. The way to increase trust is to align the decision-making model and the decision-maker’s mental model [22]. In addition, the combination of domain expertise and a decision model provides better and more robust decision support [23]. Decision models approximate the empirical reality, but they can also help decision makers understand the implications of their own assumptions and mental models [24]. Therefore, it is important that the decision and mental models are matched, and as a result of this matching, the expert empirical ranking should be consistent with the ranking generated by the decision model.

When it comes to the construction of the ranking of classification models, an important problem that needs to be considered in the assessment of such models is the risk of over-fitting. Over-fitting occurs when the model works well on the training set, but does not cope well with the classification of new cases, e.g. included in the validation set [25]. In practice, it is important to prevent over-fitting, so that the classification model classifies the cases in the training set and in the validation set equally well. Another important problem related to the assessment of credit scoring classification models is the fact that credit scoring prediction is carried out in a changing environment [26]. Therefore, there is a risk of degradation of the performance of the classification model (drift) over time [27]. Moreover, the temporal increase in model error may not be the only sign of its degradation. Some classification models may perform quite well “on average”, but the variability of their error values may fluctuate significantly over time [28]. Error variability degradation is a major challenge for classification models, so it is important that the model has a low variability of classification results over time.

The purpose of the research and the method of its implementation were adopted taking into account all the above-mentioned issues regarding:

Building a ranking of credit scoring classification models, including empirical evaluation and multi-criteria evaluation,
Decision-makers’ lack of trust in MCDM methods they are unfamiliar with and the need to increase this trust by matching the results of the decision-making model and the decision-maker’s mental model,
Risks of over-fitting, drift over time and degradation of the volatility of errors in the credit scoring classification model.

The aim of the research is to develop and practically verify a framework for multi-criteria assessment of classification models for the purposes of credit scoring. The framework takes into account the preferences of the analyst and the future user of the model and supports the expert in choosing the best model from among many variants of models intended for prediction of loan repayment. In this context, it is important to maintain the comparability of the obtained results for different models and to obtain a result in the form of a ranking of classification models as similar as possible to an expert empirical ranking based on a mental model. The framework is based on the MCDM method called PROSA (PROMETHEE for Sustainability Analysis) [29], thanks to which the comparability of individual classification models was ensured. Basing the framework on the PROSA method brings added value to the modelling process, allowing for the evaluation of classifiers to include (1) the consistence of the results obtained on the training set and the validation set, and (2) the consistency of the classification results for the data obtained in different time periods.

The article consists of 6 sections, the first of which is this introduction. The second section presents a review of the literature on the problems of credit scoring, assessment of classification models, including the multi-criteria assessment of classifiers. The third section, materials and methods, contains descriptions of the methods, data and methodological framework used. The fourth section presents the results of the classification models assessment, and the next section presents the discussion, in which the parameters of the assessment model were adjusted in such a way that its results were consistent with the results of the empirical assessment of experts. The article ends with the conclusions.

Literature review

When selecting algorithms for classification methods, the most common is the conventional approach, which includes, among others, knowledge from experts, trial and error method or theoretical analysis of the issues under consideration. Such proposals, according to Wang et al. [30], however, have the following disadvantages: high computational costs in the case of quite large data sets, inability to obtain knowledge about all classifiers resulting from the assessment of their representational errors, and despite the possibility of cooperation with field experts, such a solution also requires significant financial and correct relations with specialists. At the same time, Khan et al. [31] indicated that there is a noticeable increase in demand for machine learning systems that could automate the process of selecting appropriate algorithms by recommending them for various tasks. In their opinion, such systems do not have the disadvantages of conventional approaches and allow the use of machine learning algorithms to solve new problems, and also allow non-experts to operate independently.

Credit scoring predictive models

Credit risk assessment is important for financial institutions, companies and regulators. Its result is influenced, among others, by skilful risk management, identification and understanding of the factors on which it depends. On the other hand, scoring systems are important tools used to assess and monitor credit risk. Providing the most accurate risk forecast is the most important task for scoring models. The additional expectations of regulatory authorities that these models should be transparent and auditable means that simple predictive models, such as logistic regression or decision trees, are still used in practice today. Another proposed approach in the literature is the use of a wider spectrum of machine learning models, although according to Bücker et al. [32] their predictive potential is not fully exploited, leading to higher provisions or more outstanding loans. Dastile et al. [33] noted that despite the advanced applications of machine learning models in credit scoring, there are two fundamental problems: the incapability of some of the machine learning models to explain predictions and the issue of imbalanced datasets. The authors reviewed the literature describing the use of statistical approaches, machine learning and deep learning in credit scoring, identified existing limitations, leading and emerging directions in this field. According to Dastile et al. [33], the group of classifiers outperforms single classifiers, while deep learning models (e.g. convolutional neural networks) showed better results compared to other models. In the literature on credit scoring, explanatory data analysis, the role of macroeconomic variables (e.g. interest rates, unemployment and inflation) and the study of the correlation relationship between variables are often overlooked.

Among the recently published studies, the work of Trivedi [34] deserves attention, which focused on building a predictive credit scoring model taking into account German credit data. According to the author, who conducted a series of comparative analyses, the use of different feature selection techniques (such as Information-gain, Gain-Ratio and Chi-Square) and machine learning classifiers (Bayesian, Naïve Bayes, Random Forest, Decision Tree (C5.0) and SVM (support Vector Machine) contributed to improving the prediction of credit scoring. The work of Teles et al. [35] presents a comparison of research results obtained using fuzzy sets with decision trees based on artificial neural networks on credit scoring to predict the recovered value. The authors pointed out that both models allow modelling uncertainty. However, fuzzy logic is more accurate in this respect, despite the difficulties with its implementation. On the other hand, presenting the problem itself is more beneficial in the case of using a decision tree. According to Kumar and Gunjan [36], machine learning is offering immense potential in Fintech space and determining a personal credit score, and entities using deep learning and machine learning techniques have the ability to serve people who do not use the services of traditional financial institutions. The test analyses of the proposed machine learning model carried out by the authors showed that it is effective and allows for a better analysis process compared to solutions not related to machine learning.

A significant number of machine learning models have been used by Provenzano et al. [37] to create a state-of-the-art. credit scoring and default prediction system. In the presented research, the authors used the latest ML/AI concepts, starting with natural language processes (NLP) applied to (textual) descriptions of economic sectors using embedding and autoencoders (AE), followed by the classification of insolvent companies using gradient boosting machines (GBM) and calibrating their probabilities, then assigned credit ratings using differential evolution DE). The interpretability of the model was achieved by implementing techniques such as SHAP and LIME, which explain predictions locally in features’ space.

An important indicator for investors and decision-makers that should be taken into account in credit scoring work is the index of economic freedom, which enables the assessment of the degree of market openness over the degree of fiscal and regulatory restrictions. The work of Puška et al. [38] presented a multi-criteria ranking of the Balkan countries based on the criteria of economic freedom. The weight of the criteria was determined using the Entropy method, and the countries were ranged using the CRADIS method (Compromise Ranking of Alternatives from Distance to ideal Solution) using a double normalisation approach, which, according to the authors, contributed to the stability of decision-making.

According to Doumpos and Zopounidis [39], multi-criteria decision (MCDA) provides analytical methodological tools for decision support based on multiple conflicting criteria and is suitable for financial decision support. MCDA participates at all levels of the financial decision-making process. It includes the stages of problem structuring and algorithmic issues related to constructing and evaluating satisfactory solutions. Roy and Shaw [40] drew attention to the few studies on sustainability credit score systems (SCSS). The authors proposed a multi-criteria SCSS, which took into account financial and management as well as environmental and social aspects. They used a combination of the Best–Worst Method (BWM) and the fuzzy-Technique for Order Preferences by Similarity to an Ideal Solution (TOPSIS) method to create a credit scoring system. BWM was used to weight factors and fuzzy-TOPSIS was used to evaluate candidates. According to the authors, the obtained solutions will help financial institutions identify borrowers who engage in sustainable business practices. Noteworthy is the proposal of the hybrid MCDM method on the Pythagorean fuzzy-environment discussed in the work by Chaurasiya and ain [41]. According to the authors, the proposed approach can be used to identify the best software used for efficient banking management software (BMS). This method is based on the Pythagorean Fuzzy Method based on Removal Effects of Criterion (PF-MEREC) and Stepwise Weight Assessment Ratio Analysis (SWARA) approaches. The objective and subjective weights are assessed by PF-MEREC, SWARA model and the preference order ranking of the various alternatives is done through Complex Proportional Assessment (COPRAS) framework on the PFS.

Issues choosing the right model and classification algorithm

According to Kalousis and Theoharis [42], the selection of an appropriate classification model and algorithm is essential for the effective discovery of knowledge on a data set. As factors that make the selection task difficult, the authors listed many criteria of classifiers’ performance and the features of the data set affecting this performance. They proposed the use of an intelligent assistant (NOEMON), which supports the selection of an appropriate classifiers. Khan et al. [31] emphasize that classification is the key and most studied paradigm in the machine learning community. However, choosing the right classification algorithm that can be used to solve a specific problem is quite a difficult task. The mentioned dilemma is formally referred to in the literature as the algorithm selection problem (ASP). The authors’ work presents a comparative assessment of, in their opinion, all known methods of selecting classifiers, based on 17 classification algorithms and 84 sets of comparative data, as well as conclusions and recommendations. According to Brodley [43], the results of empirical comparisons of learning algorithms show that each algorithm has a selective superiority. This means that it is best for some but not all tasks. Due to the dataset, it is often impossible to say a priori which algorithm will provide the best performance. For some tasks, it is reasonable to use different classifiers, and then it is suggested to create a hybrid classifier that will include the best properties of individual algorithms. Whereas Amancio et al. [44] argue that in works on classifiers, the research focuses primarily on the performance of a given algorithm or the comparison of different classification methods. In many cases, in their opinion, researchers who are not machine learning experts struggle with practical classification tasks without adequate knowledge of the underlying parameters and use their default configuration. As a result of their experiments, the researchers noticed that there is a strong influence of the number of features on the performance of classifiers and that there are different responses of algorithms to the same set of variables. In turn, Vela et al. [28] found that the time dependence of the classification model results was practically ignored in classifier implementations. They noted that it is generally accepted that once a model has been trained to the required quality, it is ready to be deployed and used without further updating or retraining. However, data-generating environments often change over time, and their statistical properties change with them. This data evolution, known as “concept drift”, inevitably affects the quality of the models to the point where the model may no longer correspond to the new reality.

In the literature on the subject, there are many proposals and applications of classifiers in various fields. Interesting research results were published by Y. Wu et al. [45]. They assess the ability of four machine learning classifiers (i.e. multinomial logistic regression—MLR; support vector machine,—SVM; random forest—RF; gradient boosting trees—GBT) for mapping lake ice cover, water and cloud cover during both break-up and freeze-up periods using the MODIS/Terra L1B TOA (MOD02) product. Accuracy assessment using random k-fold cross-validation (k = 100) showed that all machine learning classifiers using a 7-band combination (visible, near infrared, and shortwave infrared) are able to achieve an overall classification accuracy greater than 94%. According to the authors, only RF was relatively insensitive to the choice of hyperparameters compared to the other three classifiers, demonstrating the potential of RF to map lake ice cover around the world based on the reflection data from MODIS TOA. In the publication on land-use/land-cover change (LULC), Talukdar et al. [46] presented a quantified assessment of these changes. They highlighted the need to investigate the accuracy of various LULC mapping algorithms to identify the best classifier needed to conduct further earth observations. The research involved six machine learning algorithms: random forest (RF), SVM, ANN, Fuzzy ARTMAP, SAM and the Mahalanobis distance (MD). Accuracy was assessed using the Kappa coefficient, ROC curve, index-based validation and root mean square error (RMSE). The results of the Kappa coefficient indicated that the applied classifiers had a similar level of accuracy, with the RF algorithm having the highest and, according to the authors, the best ML classifier, while the MD algorithm had the lowest accuracy. The main goal of the study by J. Roy and S. Sah [47] was to assess the vulnerability to erosion of the gorge (Hinglo river basin, an important tributary of the Ajay river—India), which combined approaches based on artificial intelligence and machine learning. A multi-layer perceptron network (MLP) was used as the base classifier, and hybrid machine learning methods, i.e. Bagging and Dagging, were used as functional classifiers. The ROC curves, mean absolute errors (MAE) and root mean square error (RMSE) were used to evaluate and compare the models. According to the authors, the integration of hybrid models with MLP increased the accuracy of the MLP models. The highest accuracy was achieved by MLP-Dagging.

The aim of the research presented by Kartal et al. [48] was to develop a hybrid methodology integrating machine learning algorithms with MCDM methods to efficiently perform multi-attribute inventory analysis. The appropriate class for each inventory item was determined on the basis of the results of the ABC (Activity Based Costing) analysis using three MCDM methods, i.e.: SAW (Simple Additive Weighting), AHP (Analytical Hierarchical Process), VIKOR (from Serbian: VIseKriterijumska Optimizacija I Kompromisno Resenje, that means: Multicriteria Optimization and Compromise Solution). In the next step, the naïve Bayesian, Bayesian network, artificial neural network (ANN) and support vector machine (SVM) algorithms were implemented to forecast the classes of predefined inventory items. Final activities focused on determining the detailed prediction performance metric of the algorithms for each method. The authors indicated that SSN and SVM are precise classifiers, both of which can be effectively applied to the issue of inventory management in a multi-criteria approach.

The efficiency of supervised classifiers was analysed in connection with the classification of biomedical data by Tuysuzoglu and Yaslan [49]. According to the researchers, the development of information technology has contributed to the improvement of storage and analysis of biomedical data sets, while machine learning methods have made a significant contribution to the evaluation and interpretation of this data. The authors obtained the optimal results of classification accuracy using SVM (Support Vector Machines) and Dictionary Learning methods, RDL (Random Feature Subspaces) and BDL (Random Instance Subspaces), which are generated using random feature/instance subspaces. Chauhan and Singh [50] proposed the use of machine learning in the diagnosis of cervical cancer to detect malignant neoplastic cells in the initial stage. They noted problems with data imbalance and non-uniform scaling across the dataset. That’s why they used Synthetic Minority Oversampling Technique along with fivefold cross-validation. The authors compared the performance of popular machine learning (ML) classifiers, such as: Naive Bayes, Logistic Regression, K-Nearest Neighbor, Support Vector Machine (SVM), Linear Discriminant Analysis, Multi-Layer Perceptron, Decision tree (DT) and Random Forest (RF) on unscaled and scaled data obtained by applying: Min–Max scaling, standard scaling and normalization. The authors proposed the best three ML algorithms in the discussed problem: RF, SVM and DT. The optimization possibilities were investigated with the methods of feature selection: univariate feature selection and recursive feature elimination (RFE). The best overall performance was obtained with the RFE random forest (RF-RFE). According to Chand et al. [51], Support Vector Machine (SVM) is one of the better classification algorithms specifically used to detect network intrusions. The authors indicated that it should be combined with other classifiers to improve performance. Research in this area has shown that the integration of SVM and random forest is an algorithm with a better classification power, especially for detecting low-frequency attacks, such as password guessing or spyware detection.

Ji et al. [52] believe that advances in machine learning have led to the increased deployment of black-box classifiers in many different applications. According to the authors, the performance of these pre-trained models should be critically and reliably assessed. Therefore, they presented an active Bayesian approach to assess the classifier performance. To this end, they performed a series of systematic empirical experiments evaluating the performance of modern neural classifiers (e.g. ResNet and BERT) on several standard image and text classification data sets. On the other hand, Gu and Jin [53] proposed an innovative partially supervised team learning algorithm called Multi-Train. It generates a number of heterogenous classifiers that use different classification models and/or different characteristics. According to the authors, the use of various input models and functions improves the performance of the presented approach compared to the existing supervised classifiers.

Overview of MCDM applications in the assessment and selection of classifiers

Many authors of publications that have appeared in recent years indicate and argue that MCDM methods are practical tools useful in the selection of machine learning (ML) classification algorithms. However, individual methods in their assessment can focus on different properties of classifiers, which results in obtaining divergent rankings. Therefore, it is often postulated to integrate several techniques, which will result in the development of a compromise, final statement. This section reviews the applications of MCDM methods for the assessment and selection of classifiers, the results are summarized in Table 1.

Table 1 Overview of the applications of the MCDM methods for the assessment and selection of classifiers

Framework for multi-criteria assessment of classification models for the purposes of credit scoring

Abstract

Introduction

Literature review

Credit scoring predictive models

Issues choosing the right model and classification algorithm

Overview of MCDM applications in the assessment and selection of classifiers

Materials and methods

Research context, decision problem and data

Methodological framework and applied PROMETHEE II and PROSA-C methods

Results

Multi-criteria decision model for the classifier evaluation

Results of classifiers assessment using the TSC aggregation

Results of classifiers assessment using the SCT aggregation

Discussion

Comparison of the PROSA solution with an expert empirical ranking

Comparison of the PROSA solution with solutions obtained using other MCDM methods

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords