The non-linear nature of the cost of comprehensibility

A key challenge in Artificial Intelligence (AI) has been the potential trade-off between the accuracy and comprehensibility of machine learning models, as that also relates to their safe and trusted adoption. While there has been a lot of talk about this trade-off, there is no systematic study that assesses to what extent it exists, how often it occurs, and for what types of datasets. Based on the analysis of 90 benchmark classification datasets, we find that this trade-off exists for most (69%) of the datasets, but that somewhat surprisingly for the majority of cases it is rather small while for only a few it is very large. Comprehensibility can be enhanced by adding yet another algorithmic step, that of surrogate modelling using so-called ‘explainable’ models. Such models can improve the accuracy-comprehensibility trade-off, especially in cases where the black box was initially better. Finally, we find that dataset characteristics related to the complexity required to model the dataset, and the level of noise, can significantly explain this trade-off and thus the cost of comprehensibility. These insights lead to specific guidelines on how and when to apply AI algorithms when comprehensibility is required.

of models? It is often claimed that they have a higher performance than more simple models, but is this always true? How often is it the case and to what extent?
This trade-off between accuracy and comprehensibility is argubaly one of the important debates in Artificial Intelligence (AI) 1 [4,5]. This trade-off can either limit the performance of AI, if accuracy is lost due to comprehensibility restrictions (for example imposed by regulators) [6,7], or hurt AI adoption, if user trust is lost due to opaqueness [8]. The Apple Card example shows that companies may use black box models to achieve higher predictive performance, but with the risk of being unable to explain their AI decisions to users or regulators. However, while there has been a lot of research mentioning this trade-off, with most claiming there is one [5, [8][9][10] and others contradicting this [11,12], there is no systematic study that assesses to what extent there indeed exists a trade-off and for what types of datasets.
The goal of this paper is to provide such a systematic study. We focus on tabular datasets as we believe that for these datasets the trade-off would be less clear -and possibly smaller than expected. Deep learning models, which are models composed of multiple layers to learn representations of data with multiple levels of abstraction [13] and can thus be considered as black box models, perform very well for classification on homogenous data such as image, audio or text but they not necessarily outperform other machine learning techniques on tabular datasets [14][15][16].
Based on the analysis of 90 benchmark datasets across different domains, we study the nature of the differences between the accuracies among a number of widely used a) opaque ("black box") models, b) comprehensible ("white box") models, and c) surrogate models used to develop a comprehensible surrogate of the opaque ones. We call the difference between (a) and (b) "Cost of Comprehensibility", that between (a) and (c) "Cost  1 We focus on prediction models trained on data using machine learning algorithms.
of Explainability", and that between (b) and (c) the "Benefit of Explaining" (Fig. 1). 2 Our main findings are: first, there is indeed a trade-off but somewhat surprisingly it appears to be highly non-linear across datasets. Both costs are relatively small for most datasets, but very large for a few. Second, there are datasets for which the comprehensible models perform as well or better than the black box models, supporting that one should not forgo trying comprehensible models [17]. We call these datasets "comprehensible datasets", as opposed to datasets where the black box is strictly better which we call "opaque datasets". Understanding what makes a dataset "opaque" vs "comprehensible" and more so, given the non-linearities observed, what makes the costs very high (positive or negative) is a challenging question as it relates to understanding the data generation processes themselves (e.g., the "nature" of the data and problem at hand). We discuss initial results indicating that some of the main differences between opaque and comprehensible datasets are about their inherent complexity as well as the level of noise in the data.
The results indicate that reporting some simple characteristics of a dataset can provide clues, for example to users or regulators, about the potential accuracy and comprehensibility trade-off. To summarize, the contributions of our paper are threefold: • A benchmark study comparing state-of-the-art white box and black box algorithms on 90 tabular datasets, and assessing their difference in performance; • An analysis of whether surrogate modelling could improve any trade-off between comprehensibility and accuracy; • Insights in how dataset properties could predict the nature/size of the trade-off we study.

What is comprehensibility?
Comprehensibility refers to the ability to represent a machine learning model and explain its outcomes in terms that are understandable to a human [18]. The lack of comprehensibility in black box models is one of their main pitfalls, as their inner working is hidden to the users preventing them from verifying whether the reasoning of the system is, for example, aligned with restrictions or preferences of how decisions are made [19][20][21]. Furthermore, it is easier to debug comprehensible models or to detect bias in them, and it also increases social acceptance [22]. In general, there are two ways to provide comprehensibility in machine learning [22,23]: intrinsic comprehensibility is acquired when using models that are comprehensible by nature due to their simple structure, which are the so-called "white box" models [23], while post-hoc comprehensibility aims to explain the predictions without accessing the model's inner structure [23], as provided by LIME [24], SHAP [25] or counterfactual explanations [26]. Another distinction that can be made is between global comprehensibility and local comprehensibility. Global comprehensibility allows to understand the whole logic of a model and follow the reasoning that leads to every possible outcome, where for local comprehensibility it is possible to understand the reasons for a specific decision [22,27]. Comprehensibility is very difficult to measure due to its subjective nature. Some compare the comprehensibility of models using user-based surveys [28,29] while others based on mathematical heuristics [9], typically the size of the model (e.g., number of rules for a rule learner, number of nodes for a decision tree, or number of variables for a linear model) [30][31][32][33]. Very deep decision trees, for example, can be considered as less comprehensible than a compact neural network [34]. We use the latter, heuristic approach to measure comprehensibility due to its objectivity and scalability.

What are intrinsically comprehensible models?
In line with the literature, we consider small decision trees, rule sets and linear models as comprehensible or "white box" models [8,22,27,35]. We limit the size of these models during training in order for them to be comprehensible. We opted for seven as the size limit for comprehensibility, based on cognitive load theory [36]. According to this theory, the span of absolute judgement and the span of short-term memory pose severe limitations on the amount of information that humans can receive and process correctly, with seven being the typically considered maximum size in both cases [36]. We consider larger decision trees 3 , rule sets and linear models as "black box" ones. We also consider three other machine learning methods in the list of black boxes we test: neural networks, random forests and nonlinear support vector machines. It is generally agreed upon that these algorithms are not comprehensible as their line of reasoning cannot be followed by human users. We base this choice of black box models on the results of benchmark studies in the literature, where these often are among the best performing ones, as can be 3 A decision tree of eight nodes is arguably not a black box model, and may be in a "grey zone" of comprehensibility. For this reason, in our experiments we focus on the very large and small trees, rule sets and linear models, defined as those with size larger than 50 or smaller than 8 (in number of nodes/rules/coefficients) as it is a general assumption in the literature that smaller decision trees are more comprehensible than larger ones due to the cognitive size limit [9,28,37,38]. This focus ensures that our findings are applicable to all applications and end users, because of the arbitrariness to consider models with size between 8 and 50 as black box, which actually depends on the application and end user. seen in Table 1. 4 Comparing all possible models available is of course infeasible, which is a practical limitation of such a study. All the papers mentioned in Table 1 compare different machine learning models but none investigate the difference in performance between the best black box model and the best white box model, nor whether this can be linked to any dataset properties. Many papers claim that black box models will always have a better performance, or on the contrary that simpler models work equally well [11,12], but a large-scale study about the difference of performance is missing. 5

Surrogate modelling
A common practice is to mimic the predictions of a black box with a global white box surrogate model, in order to improve the accuracy while remaining comprehensible [50,51]. The typical process is to first build a black box model using the available training data, and then build a comprehensible model by training a white box model using the predictions of the black box instead of the original training data. This process is called surrogate modelling [22], oracle coaching [52,53], or rule extraction in case the white box model is a decision tree or rule set [6,54]. A key metric of the quality of the surrogate model is fidelity, which measures how well the predictions of the surrogate model match those of the black box [55]. The most common goal of this kind of modelling is to use the surrogate model to explain the black box model, while still using the black box to make predictions. This requires of course that the surrogate model is (1) more comprehensible than the black box model and (2) sufficiently explains the predictions made (high fidelity). One can also use the surrogate model instead of the black box to make predictions, in order to improve the performance one could achieve using only comprehensible models. A possible reason why this approach can work, instead of just training a white box model directly using the training data, can be that the black box model may filter out noise or anomalies that are present in the original training data [53,56]. In this case, a comprehensible model mimicking a black box may be more accurate than a comprehensible model trained on the original data, as shown in some previous work [51][52][53]. Therefore, we also investigate whether surrogate modelling can lead to better performing comprehensible models and, as such, improve the trade-off we study. Specifically, for each dataset we train a white box on the predictions of the best performing black box for that dataset. We call this a surrogate white box model as opposed to a comprehensible model trained on the training dataset which we call a native white box model-see Fig. 1.

Dataset properties
Finally, we study whether there are simple (standard) properties of a dataset that may determine whether it is opaque (the best black box model outperforms the best white box) or comprehensible (the reverse happens). We use a standard toolbox, Alcobaba [57], which automatically extracts numerous characteristics ("meta-features") for any given dataset. We consider four types of dataset characteristics from this toolbox: general ones, which capture basic information such as the number of instances or the number of attributes [58]; statistical ones, which capture information about the data distribution such as the number of outliers, variance, skewness, etc. [58]; information-theoretic ones, which capture characteristics such as the joint entropy, class entropy, class concentration, etc. [58]; and so-called complexity related ones, which, for example in the case of a classification problem estimate the difficulty in separating the data into their classes [59]. 6 We opt for using a standard toolbox and set of dataset characteristics to make this analysis general, easily reproducible and simple to use in practice.

Materials
We use a large benchmark study to compare the algorithms on different tabular datasets. Benchmark comparisons are usually developed over a few, typically standard data sets, as a machine learning method might perform well on some of the datasets but not generalize to a broader range of problems [43].
To perform our experiments, we use all the binary classification datasets from the Penn Machine Learning Benchmark (PMLB) suite [43]. This is a dataset suite that is publicly available on Github, 7 which consists both of real-world and simulated benchmark datasets to evaluate supervised classification methods. It is compiled from a wide range of existing ML benchmark suites such as KEEL, Kaggle, the UCI ML repository and the meta-learning benchmark. At this moment, PMLB consists of 162 classification datasets and 122 regression datasets. We focus on the binary classification datasets which amount to 90 datasets in total.
Some preprocessing was already done by the compilers of this benchmark suite. All the datasets were preprocessed to follow a standard row-column format and all the categorical and features with non-numerical encodings were replaced with numerical equivalents. All datasets with missing data were excluded, to avoid the impact of imposing a specific data imputation method. The used datasets are shown in Table 3.

Methods
Our methodology is shown in Fig. 2. For each dataset we create a training and test set, using 75% of the data for training and 25% for testing. Both the training and the test set are scaled according to the parameters of the training set with Sklearn's MinMaxScaler. 8 This estimator scales each feature individually so that it is between zero and one on the training set. We also use a stratified split to make sure that enough labels are present for the training phase. GridSearchCV from Sklearn 9 is used with its default 5-fold cross validation to tune the hyperparameters of every model. The dataset is divided in five 7 https:// github. com/ Epist asisL ab/ pmlb. 8 https:// scikit-learn. org/ stable/ modul es/ gener ated/ sklea rn. prepr ocess ing. MinMa xScal er. html. 9 https:// scikit-learn. org/ stable/ modul es/ gener ated/ sklea rn. model_ selec tion. GridS earch CV. html. 6 See Supplementary Information material. folds, where each time another fold is taken as the validation set. GridSearchCV then performs an exhaustive search over a specified hyperparameter grid, which is reported in the Sects. "Black Box Models", for each modelling technique, and then checks on the validation set which parameter settings performed best. By doing this five times, instead of just using one validation set, we get a more accurate representation of how the model behaves on unseen data, and we are not reliant on the data we used as the validation set. We select the best hyperparameter values for each modelling technique based on this tuning. Moreover, for each dataset we also select the best surrogate model. We do this by creating a new training set, which is a copy of the original training set but with as labels the predictions of the best black box model, based on the cross-validation performance. The surrogate model is trained on this relabeled training set and can be any of the original white box models, as well as Trepan or RuleFit. The final performance of all the models(black box, white box and surrogate) is evaluated on the test set based on two metrics: accuracy and f1-score. The difference in the test set performance among the different models is shown in Fig. 3. For each dataset we select the best black box, the best white box and the best surrogate, based on their performance on the test set. 10 In our aggregate analyses, we compare the test performances of these across all datasets.

Fig. 2 Methodology
10 Note that using the test data to select the best black and white boxes and then reusing the same data to compare those two across all datasets adds some bias in the results. We opt for this approach (instead of also using, for example, a validation set) as some datasets do not have many observations and we only select among a few (in total six) black boxes and among a few (in total three) white ones, making the bias small. We also verified whether our results are robust when using cross validation to select the best model and note that our results indeed hold (e.g., still for 68.89% of the datasets, the best black box model outperforms the best white box model).

Black box models
We use three state-of-the-art black box models: neural networks, random forests and nonlinear support vector machines [39,60]. As noted below, we also include in the list of black boxes the three comprehensible models when their size -after training -is very large.
Random forest We use the RandomForestClassifier 11 from Sklearn and use a grid search to tune the number of trees in the forest with values between 10 and 2000 and the number of features to consider when looking for the best split with ('sqrt' , 'none').
Support vector machine We use the SVC 12 from Sklearn and use a grid search to tune the regularization hyperparameter with values between 0.1 and 1000 and the kernel coefficient with values between 0.0001 and 1. We use the default kernel type of rbf.
Neural network We use the MLPClassifier 13 from Sklearn and use a grid search to tune the size of the hidden layer. We only test neural networks with one hidden layer. We tune the hidden layer with sizes between 10 and 1000.

Comprehensible models
We use three models that are in general considered to be comprehensible, when their size is constrained. As discussed in the main article, we limit the size of these models to 7 (maximum number of nodes for trees, rules for rule based systems, coefficients for logistic regression). We also train these models without constraining their size. In this case, when their size after training is very large, with more than 50 elements,we consider them as part of the black boxes in our analysis.
Decision tree We use the DecisionTreeClassifier 14 from Sklearn. We use a grid search to tune the function to measure the quality of the split (gini, entropy), tune the maximal depth between 2 and 30 and tune the minimum number of samples in a leaf (2,4). We tune the maximal amount of leaf nodes between 2 and 7 for the constrained cases (white boxes) and between 2 and 1000 for the unconstrained ones (black boxes).
Logistic Regression We use the LogisticRegression 15 from Sklearn. We use l2 regularization and the liblinear solver. We use a grid search to tune the regularization parameter values between 0.0001 and 1000.  Ripper We use a rule learning algorithm, based on sequential covering. This method repeatedly learns a single rule to create a rule list that covers the entire dataset rule by rule [22]. RIPPER (Repeated Incremental Pruning to produce Error Reduction), which was introduced by Cohen in 1995 is a variant of this algorithm [61]. We use the Python implementation of Ripper hosted on Github. 16

Surrogate models
We use the three comprehensible models above but this time we train them on the predictions of the best performing black box instead of using the training data. We also include Trepan [54], which is used for rule extraction based surrogate modeling, and RuleFit [62], which is based on an underlying Random Forest model. Again, we limit the size of the comprehensible models to 7.
Trepan We use the Python package Skater to implement TreeSurrogates, 17 which is based on [54]. The base estimator (oracle) can be any supervised learning model. The white box model has the form of a decision tree and can be trained on the decision boundaries learned by the oracle. We use the same hyperparameter settings to tune the decision trees from Trepan as for the DecisionTreeClassifier.
RuleFit The RuleFit algorithm learns sparse linear models that include automatically detected interaction effects in the form of decision rules [62]. The interpretation is the same as for normal linear models but now some of the features are derived from decision rules. We use the Python implementation of RuleFit hosted on Github. 18

Results
First, we address the cost of comprehensibility, by testing whether native white and black box models have a significant difference in performance. To assess this cost, we use both the models' f1-score and accuracy. 19 The figures for the latter are reported in Fig. 6. We first compare all the classifiers using the Friedman test 20 [63] to identify whether there are any significant differences between the different models, and then the post-hoc Nemenyi test [64] to identify significant pairwise differences. 21 The null hypothesis of the Friedman test is rejected with a p-value of 2.43 · e −25 (a value with the same order of magnitude when using accuracy instead of f1-scores). This means that there are significant differences among some groups of algorithms. We use the post-hoc Nemenyi test to perform all possible pairwise comparisons [65]. The results are shown in the critical difference diagram 22 in Fig. 3. The performance of the black box models (RF, MLP, SVM) is significantly better than the performance of the white box models (DT, LR, Ripper), already confirming that, overall, the cost of comprehensibility indeed exists. 16 Imoscovitz. Ripper Python package. url: https:// github. com/ imosc ovitz/ wittg enste in. 17 A. Kramer et al Skater Python package. url: https:// github. com/ oracle/ Skater. 18 Molnar. RuleFit Python package. url:https:// github. com/ chris tophM/ rulefi t. 19 We include the results with f1-score to account for imbalance issues that could bias our results. 20 https:// docs. scipy. org/ doc/ scipy/ refer ence/ gener ated/ scipy. stats. fried manch isqua re. html. 21 We cannot just use a pairwise comparison because this would inflate the probability of a type I error. The Friedman test is the non-parametric equivalent to the repeated-measures ANOVA [63]. 22 These diagrams were created with the Orange Data Mining Library [66].

The cost of comprehensibility
Having established that the cost of comprehensibility exists, we study how large it is across datasets. As discussed, for each dataset we select the best black and white boxes and measure their relative difference in performance -namely, the cost of comprehensibility. Figure 4a shows the results across all datasets when we order them according to this cost. This figure reveals a somewhat surprising result: this cost is highly non-linear (e.g., the plot is a sigmoid instead of being closer to a straight line). For most datasets the accuracy-comprehensibility trade-off is low, only for a few it is very high (right) and for a few it is very "negative" indicating that comprehensible models largely outperform the black box ones for these datasets (left). Yet, for 68.89% of the datasets the best black box model outperforms the best white box model, reconfirming the overall existence of the cost of comprehensibility. The results for accuracy can be seen in Fig. 7a. a b Fig. 4 Comparing black box and white box models. For both plots, the datasets are ordered according to the gap in f1-score between the best black box and the best native (left figure) or surrogate (right) white box model (right). The y-axis measures the relative difference in the f1-score, defined as the ratio of the difference between the black and white box f1-scores divided by that of the best model

Can surrogate modeling improve the accuracy-comprehensibility trade-off?
We next investigate whether surrogate modelling can improve the performance of the (native) comprehensible models. For all datasets we generate the best black box and the best (native) white box trained on the training data, and then we also train a surrogate model mimicking the best black box one -what we previously called a surrogate white box. We compare the performance of these three types of models across all datasets in Fig. 5. As indicated in Fig. 5a, surrogate modelling does improve accuracy slightly relative to native white box models, on average across all datasets. We term this improvement the "Benefit of Explaining", a benefit in terms of improved predictive accuracy. Based on the Wilcoxon Signed Rank test 23 [63], used to compare classifiers across several datasets, we can reject the hypothesis that the native and surrogate white boxes perform equally well (p-value 0.003)the latter performing on average better. The result for accuracy can beseen in Fig. 8. We perform the same analysis, but this time for two different types of datasets: those for which the best performing model is a black box, what we termed opaque datasets, and those for which white boxes perform at least as well as or better than black boxes, what we called comprehensible datasets. The results are shown in Fig. 5b, c. Interestingly, in this case the surrogate white box models outperform the native white box models on average across the opaque datasets (Wilcoxon test p-value of 7.72 · e −5 ), while the two are not significantly different for the comprehensible datasets (Wilcoxon test p-value of 0.20). In the latter case there is no need to go through a black box if its performance is not better than that of a native white box [56,67], as the latter would dominate both in terms of accuracy and comprehensibility. Hence, if one considers only opaque datasets, the use of surrogate modeling can indeed improve the accuracy-comprehensibility trade-off on average.

The cost of explainability
Next, we investigate the difference in performance between the best black box model for each dataset and the best surrogate white box model from that black box -what we call the cost of explainability. Fig. 4b shows the results when we sort all datasets based on this cost. The results are similar to what we observe for the cost of comprehensibility: the difference is small for most datasets, but very large for a few. The results are also in agreement with those in Fig. 7, where we see that the cost of explainability is a bit lower than the cost of comprehensibility (Fig. 7).

Opaque vs. comprehensible datasets
Finally, we study whether the cost of comprehensibility relates to some properties of the dataset. To do so, for each dataset we generate a number of standard dataset properties as discussed above (see also Supplementary Information material), and use them to explain the cost of comprehensibility. Specifically, we run a regression analysis using the generated dataset properties as independent variables with the dependent variable being the difference between the performance of the best black box model and the best native white box model. We used all 90 datasets, hence the number of observations used for the regression was also 90. The variables that are  Table 2. Overall, these results indicate that properties related to the complexity required to model a dataset and the level of noise in a dataset significantly explain the cost. While this is a relatively simple analysis, the results suggest that one may be able to identify or communicate whether there is a potential cost of comprehensibility by simply reporting specific dataset properties. Specifically, the following five properties are found to be significant. F1v, which is the directional-vector Maximum Fisher's discriminant ratio that indicates whether a linear hyperplane can separate most of the data, where lower values means that more data can be separated this way [59]. L1, which is a linearity measure that quantifies whether the classes can be linearly separated [58]. Higher values of this attribute indicate more complex problems as they require a non-linear classifier [59]. These properties have a positive coefficient in the regression analysis, which means that all these factors increase the gap between the best black box model and the best white box model. The sign of these coefficients is as expected, namely that for datasets that are more complex to separate linearly, the performance of black box models compared to simple models is on average better.
Two other features, EqNumAttr and NsRatio, capture information related to the minimum number of attributes necessary to represent the target attribute and the proportion of data that is irrelevant to the problem (level of noise) [58,68]. We see that these dataset properties have a negative relationship with the size of the cost. Note that when we analyze this result at the level of each individual prediction model, we see that these properties negatively affect both the performance of the black box models and the white box models, but more so for the black box ones. This could be because black box models may pick up more of the noise or use a lot of irrelevant features. Finally, N3 [59] is a neighbor-based measure that refers to the error rate of the nearest neighbor classifier. Low values of this dataset property indicate that there is a large gap in the class boundary [69]. We see again that this property negatively affects both the performance of the black box models and the white box models [69], and that the effect on the gap depends on how much it affects the performance of each model.

Discussion
Understanding the trade-off between comprehensibility and accuracy can have important implications for regulators as well as companies [70]. Our results indicate that most of the time the trade-off is relatively small, indicating that one should consider native Table 2 The dataset properties that are significant when explaining the cost of comprehensibility using a number of standard dataset properties as independent variables in a regression model where the cost is the dependent variable white box algorithms as a key benchmark. Indeed, given the non-linearities we observe, one would expect that black boxes are used relatively infrequently, even if for the majority of cases they outperform white boxes, as our study indicates that this outperformance is typically relatively small. Some papers in the literature also indicate that for certain datasets simple models work as well as complex ones [11,12] or that for most datasets the out-performance by black box models will be very small [71], despite the popular belief that more complex models are always better. Of course it depends on the use case and application domain whether this small difference in performance is worth the loss in comprehensibility. Due to social and ethical pressure, insight in when one should opt for a comprehensible model could be a competitive differentiator and drive real business value [70]. Insights in this trade-off could lead to specific guidelines from regulators on how and when to apply AI algorithms when comprehensibility is required.
Our results also show that using surrogate modelling could reduce the cost of comprehensibility, especially for opaque datasets. As we discussed, this may be the case because the black box model in between can filter out noise and anomalies [53,56]. We also see that simple properties of a dataset could provide insights (for example to a third party such as a user or regulator) in the nature of the trade-off without requiring knowledge of the algorithms tested or the data used. For example, attributes that measure how difficult it is to linearly separate the data are significantly correlated with the size of the gap. Indeed, one would expect that for these datasets black box models might be better in capturing the non-linearities. This can lead to practical tests of the feasibility of using a native white box -and the potential accuracy loss -in a given use case.
Our general findings suggest the following guidelines: 1. Start with white box models. 2. Train additional black box models if: (a) the application allows for a (possibly small) increase in performance at a cost of comprehensibility, and, (b) the level of noise is high and the data requires complex modeling, as indicated by the listed, easy to calculate dataset metrics. 3. If there is a practically important cost of comprehensibility (hence you are dealing with an opaque dataset), apply additional surrogate modeling algorithms.
Finally, we note that in this study we focused on tabular datasets. For other kinds of datasets, the trade-off we study may be different. For example, for image or text data, more flexible models are needed to handle the data complexity [9,13] and the difference in performance between comprehensible models compared to black box ones such as deep learning is often considered unbridgeable [8].

Dataset properties
For the analysis of the dataset properties, we use the metafeature toolbox of Alcobaba [57], that automatically extracts metafeatures out of the dataset. The metafeatures of this toolbox are based on those described in [58]. We select the metafeatures out of the groups: general, statistical, info-theory and complexity. The general metafeatures represent the basic information about the dataset. They capture metrics such as the number of instances, attributes, or other information about the predictive attribute [58]. The statistical measures represent information about the data distribution like the number of outliers, the variance, the skewness or the correlation in the data, and others [58]. The information-theoretic measures capture the amount of information present in the data such as the joint entropy, class entropy, class concentration, and others [58]. The last group of measures we include is the group of information-complexity based on [59]. We do not include the clustering, landmarking or model-based metafeatures because they already fit a model to the dataset and extract information from this model. The used dataset properties can be seen in Table 4. 24

Extra results on accuracy
We report the empirical results as in the main article, this time using the accuracy of the models as our metric instead of the f1-score. All results are in line with the results for the f1-score. The hypothesis of the Friedman test is rejected with a value of 2.09 · e −23 .
In Fig. 6, we show that the black box models are significantly better than the white box models but not significantly different from each other. The same can be said for the white box models. We see a non-linear nature of the cost of comprehensibility and explainability in Fig. 7a and b. Finally, from the boxplots in Fig. 8 we see again that for the opaque  Fig. 7 Comparing black box and white box models. For both plots, the datasets are ordered according to the gap in accuracy between the best black box and the best native (left figure) or surrogate (right) white box model (right).The y-axis measures the relative difference in the accuracy, defined as the ratio of the difference between the black and white box accuracy divided by that of the best model datasets the surrogate white box models are better on average than the native ones. We also reject the hypothesis that the native and surrogate white boxes perform equally well (p-value 9.63 · e −6 ) on average across all datasets. When we perform the same analysis for the two different types of datasets, we see again that the surrogate white box models outperform the native white box ones for the opaque datasets (Wilcoxon test p-value of 2.71 · e −6 ), while the two are not significantly different for the comprehensible datasets (Wilcoxon test p-value of 0.53). All these results are comparable with the results obtained when using f1-score as a metric. finally, we also compare the dataset properties that predict whether a dataset is opaque or comprehensible and see if they are the same for both metrics. We see in Table 5 that the same dataset properties are important in predicting the gap in accuracy as in predicting the gap in f1-score, but that now some more attributes are significant. F1v, L1, EqNumAttr and NsRatio were already significant in predicting the gap in f1-score. The linearity measures L2 and L3 are now also significant but they have a similar meaning as L1, namely they are linearity measures that quantify whether the data is linearly separable, which means higher values of these attributes point to more complex problems [59]. N4 signifies the non-linearity of the nearest neighbor classifier and higher values are also indicative of problems of greater complexity [59]. F3 signifies the Maximum Individual Feature Efficiency where lower values indicate simpler problems [59]. JointEnt computes the relationship of each attribute with the target variable, capturing the relative importance of the predictive attributes [58]. CanCor measures the canonical correlation between the predictive attribute and the target [58]. a b c Fig. 8 Comparison across datasets of best black box model for each dataset, surrogate white box model mimicking this best black box, and best native white box model. BB stands for black box and WB for white box. The line at 0 indicates the performance of the best black box model. The y-axis indicates the absolute difference in accuracy from the best black box model Table 5 The dataset properties that are significant when explaining the cost of comprehensibility using a number of standard dataset properties as independent variables in a regression model where the cost is the dependent variable