 Methodology
 Open access
 Published:
A multidimensional hierarchical evaluation system for data quality in trustworthy AI
Journal of Big Data volume 11, Article number: 136 (2024)
Abstract
Recently, the widespread adoption of artificial intelligence (AI) has given rise to a significant trust crisis, stemming from the persistent emergence of issues in practical applications. As a crucial component of AI, data has a profound impact on the trustworthiness of AI. Nevertheless, researchers have struggled with the challenge of rationally assessing data quality, primarily due to the scarcity of versatile and effective evaluation methods. To address this trouble, a multidimensional hierarchical evaluation system (MDHES) is proposed to estimate the data quality. Initially, multiple key dimensions are devised to evaluate specific data conditions separately by the calculation of individual scores. Then, the strengths and weaknesses among various dimensions can be provided a clearer understanding. Furthermore, a comprehensive evaluation method, incorporating a fuzzy evaluation model, is developed to synthetically evaluate the data quality. Then, this evaluation method can achieve a dynamic balance, and meanwhile achieve a harmonious integration of subjectivity and objectivity criteria to ensure a more precise assessment result. Finally, rigorous experiment verification and comparison in both benchmark problems and realworld applications demonstrate the effectiveness of the proposed MDHES, which can accurately assess data quality to provide a strong data support for the development of trustworthy AI.
Introduction
With the rapid development of computer software and hardware technology, artificial intelligence (AI) has made significant breakthroughs, which is increasingly applied in multiple fields of human production and life [1,2,3]. AI has been proven particularly professional to predict stock prices or the stock tendency in the financial field [4]. In the medical field, AI can assist doctors to diagnose diseases and perform surgery [5]. Moreover, AI can identify realtime environmental information for path planning, thus enhancing the likelihood of early arrival of unmanned vehicles in automated driving [6]. However, as the widespread application of AI, the continuous issues, such as the underrepresentation of data or the unfairness of model outputs, have become an obstacle to permeate AI into the actual scenarios [7, 8].
To solve aforementioned issues, trustworthy AI has emerged as a thoroughly important scientific research direction [9]. Over the past decades, numerous investigators have concentrated their efforts on optimizing model structures or enhancing learning algorithms in order to enhance the credibility of AI [10, 11]. For example, Han et al. introduced fuzzy set theory to devise a new buildingunit, named fuzzy denoising autoencoder (FDA). This novel unit was used to construct fuzzy deep network [12]. The results displayed that this FDA can extract more robust features compared with the basic unit in traditional DNN, to mitigate the effect of uncertainties. Moreover, Rozsa et al. proposed the batch adjusted network gradients (BANG) for model training, leading to improvements in model accuracy [13]. However, it is undeniable that AI has been criticized in recent years for a major weakness: the lack of interpretability in its decisionmaking processes [14, 15]. To solve this problem, Dubey et al. devised a scalable polynomial additive model (SPAM) [16]. This SPAM employed tensor rank decompositions of polynomials to create an inherentlyinterpretable model. In addition, Meng et al. proposed a semanticenhanced Bayesian personalized explanation ranking (SEBPER) model, leveraging the interaction information and semantic information [17]. In this SEBPER model, interaction information was utilized to form a latent factor representation by constructing the interaction matrix. Then, the semantic information was adopted to optimize this factor representation, thereby improving the rationality of decision results. Additional research efforts aimed at the performance improvement of neural network can be found in [18,19,20]. It imperative to note that the foundation of all these methods, ranging from [12] to [20], is trustworthy of data they adopted. With the maturity of AI technology, including the emergence of automated machine learning (AutoML) platform and industrystandard platforms like PyTorch, it is much easier to develop and improve models when the data are provided than before [21]. Nevertheless, according to surveys conducted in [22, 23], it is staggering that 96% of enterprises encounter challenges related to data quality or labeling in their AI projects, while 40% of them lack the confidence for ensuring the data quality. If a model captures the biases and incorrect correlations of data, it adheres to the principle that “garbage in, garbage out,” and will have a significant impact on the credibility of AI [24]. In fact, the breakthrough of AI benefits from the development of highquality data. which means that the data quality has become absolutely crucial.
In particular, as the emphasis transitions from model optimizing to data improvement, data scientists often spend nearly twice the time on data loading, cleansing and visualization in comparison to model training, selection and deployment [25]. Andrew Ng, a respected figure in the field of AI, has emphasized that 80% data plus 20% model can equal better machine learning [26]. That indicating that more outstanding outcomes of AI can be well achieved by enhancing data quality, especially when the model remains fixed. For instance, ChatGPT, the renowned chat robot, demonstrated greater surprise and credibility in its third iteration compared to its predecessor. This advancement was primarily due to multitudinous efforts in acquiring highquality data for training, rather than substantial modifications to the model's structure [27]. Furthermore, Professor Songchun Zhu, a globally renowned expert in AI field, once proposed “The key to general AI lies in establishing a "heart" for machines. Data provides learning materials and basis for intelligent agents, and is the foundation for their mental formation” [28]. Good data can guide AI towards goodness and trustworthy. Therefore, more scholars are participating in the evaluation and improvement of data quality. In [29], Liang et al. primarily explored the influence of data pipeline on the credibility of AI, encompassing data design (sourcing and documentation), data sculpting (selection, cleaning and annotation), and data strategies for model testing and monitoring. Their work offers valuable theoretical insights into significance of data quality. However, a specific implementation was not given. Additionally, a data protection strategy, using the backdoor watermarking, was given to authenticate data ownership [30]. This proposed protection method employs poison backdoor attacks for data watermarking. Then, a hypothesistestguided method was utilized for verification. The experiments demonstrated that this protection method can effectively prevent data theft to enhance reliability of AI. Caballero proposed a 3Cs model, which is composed of three data quality dimensions for assessing the quality of data: contextual consistency, operational consistency and temporal consistency [31]. Nevertheless, it only focuses on a single dimension of data, potentially overlooking other crucial aspects. A multidimensional assessment could provide a more overall understanding of AI credibility. For examples, the significance of data integrity and consistency as pivotal quality assurance dimensions was emphasized together [32]. In addition, a comprehensive assessment methodology, encompassing multiple dimensions, was introduced to estimate the integrity, redundancy, accuracy, timeliness, intelligence and consistency of power data [33]. The result demonstrated that this assessment method can provide a foundation for data analysis and data mining to facilitate the trustworthy AI in power system.
However, the following problems should be future discussed.

1.
The evaluation dimensions are currently qualitative, meaning we understand the actions required but lack clarity on the rationale behind them or their potential scope. Practical application guidelines remain somewhat ambiguous. In datadriven scenarios, the performance of AI models is crucial and complex. Due to the qualitative nature of evaluation dimensions, it is often difficult to accurately measure how these variables individually or synergistically affect the model, making it difficult to develop effective optimization strategies for the improvement of AI models.

2.
When data is input into an AI model, its performance is not only influenced by a single dimension of the data, but also by multiple factors. The intricate interaction between various dimensions is often overlooked or simplified in the current evaluation methods, resulting in an inability to comprehensively and deeply understand the data and evaluate data. Moreover, a further consideration is needed that the harmonious integration of subjectivity and objectivity criteria to ensure a more precise assessment of data quality.
In order to solve this problem, a multidimensional hierarchical evaluation system (MDHES) is proposed to estimate the data quality in this paper. The main contributions of this paper are as follows.

1.
Multiple dimensions are designed to evaluate data condition separately. Innovatively, the evaluation dimensions have been not only given, but also the specific quantitative formulas, which can provide a clearer understanding of the strengths and weaknesses among these dimensions, by calculating the individual scoring. Moreover, the effects of data improvements can also be explicitly measured, which offers a constructive guidance and feedback on the implementing enhancements for researchers.

2.
A comprehensive evaluation method, incorporating fuzzy evaluation model, is developed to synthetically evaluate the data quality, based on the score value of each dimension. This method focuses on interactions of dimensions to achieve the dynamic balance. Furthermore, the adoption of fuzzy evaluation model can achieve a harmonious integration of subjectivity and objectivity criteria to more accurately reflect the data quality.
The outline of this paper is organized as follows. Section "Multidimensional hierarchical evaluation system" introduces the proposed MDHES, consisting of quantization of evaluation dimensions and the comprehensive evaluation method in detail. Subsequently, the experimental results and the comparisons on the benchmark problems are discussed to demonstrate the effectiveness of the proposed MDHES in Section "Results and discussion". Section "Practical application of multidimension comprehensive evaluation system" highlights realworld applications implemented for cybertelecoms fraud identification, while the conclusions are given in Section "Conclusion".
Multidimensional hierarchical evaluation system
In this section, a MDHES is proposed for data quality evaluation. As shown in Fig. 1, multiple crucial dimensions, consisting of completeness, accuracy, consistency, variousness, equalization, logicality, fluctuation, uniqueness, timeliness, standard, which encompasses the entire data pipelines (data processing, data usage, data storage and more), are meticulously designed to evaluate data condition separately by calculating individual score of each dimension. Then, a comprehensive evaluation method, integrating individual scores, is developed to provide a synthetically evaluation for the data quality. Next, the proposed MDHES will be introduced in detail.
Design and calculation of dimensions for data quality evaluation
1. Completeness: Data completeness refers to the absence of any gaps or missing values within a training data set. Training data containing miss values will lead to AI models yielding inaccurate assumptions, primarily because the incomplete data only provides a partial information, potentially causing costly mistakes and significant waste of resources [34]. The data completeness encompasses several aspects: the comprehensiveness of features, the fullness of feature values, and the adequacy of data size. For the comprehensiveness of features, θ_{11} can be given as
where ℧_{11} is the number of features in the benchmark data, while Ω_{11} represents the number of features in the training data. Baseline data is meticulously gathered by measurement organizations tailored to specific scenarios, resulting in a comprehensive data set that meets specific requirements. However, acquiring such data demands considerable effort and financial investment.
The presence of null values will affect the availability of data, undermining the decisionmaking capabilities of AI, especially when the training data contains too many null values. Therefore, it is imperative to consider the fullness of feature values and can be defined as
where Ω_{12} signifies the number of training data samples with null values, and S indicates the total number of training data sample. Having an adequate amount of training data is crucial, which can help the AI model to learn the underlying patterns.
and essential laws, thereby enhancing its generalization ability. It can be described as
where v_{i} is the ith sample target size in the training data, and γ denotes the zscore associated with the confidence level ε. p_{i} is the proportion of the ith category in training data. The score of data size can be calculated as
where s_{i} indicates the actual size of ith category. Based on the above analysis, the training data completeness is
where w_{11}, w_{12} and w_{13} are the connect weights for each dimension, respectively.
2. Accuracy: The data accuracy is the degree of closeness between the collected data and the actual true values [35]. The perturbation data, such as the false data by the deep fake algorithm, and adversarial sample in dataset, may hinder AI model to recognize the correct patterns, ultimately affecting the accuracy of AI models. Therefore, for the general perturbation data can be discovered by outlier detection technology, such as the 3sigma criterion
where x_{ij} is the value of ith row and jth column in training data. x̄_{i} and κ_{i} represent the average value and standard deviation of ith column respectively. The accuracy of training data θ_{2} can be given as
where Ω_{21} is the number of outliers.
For the carefully designed perturbation data (false data by the deep fake algorithm, the contaminated data), regarded as adversarial sample, the robustness of data can be evaluated by the Adversarial Category Average Confidence (ACAC) [36], which can be defined as the average confidence level of the model classifier in misclassifying
where Ω_{22} represents the number of successful adversarial sample attacks, xadv iis the ith adversarial sample and y_{i} is the corresponding label value. Therefore, the accuracy of training data θ_{2} can be given as
where Ω_{2} is the number of outliers.
3. Consistency: Contradictory samples in training data can confuse AI models, preventing them from understanding correct relationships among data set. This confusion can reduce the prediction stability of models across various scenarios and even over time [37]. Therefore, it is crucial to measure the consistency of training data. First, the format of features in training data is verified
where θ_{31} is the score for format consistency, Ω_{31} indicates the number of samples with different formats for a feature. Moreover, by comparing the values of features and labeled values between each sample, the content consistency for the same matter can be given as
where Ω_{32} denotes the number of samples containing conflicting content. In addition, after content consistency evalution, adversarial samples with the target attacks will also be detected. Therefore, the consistency of training data θ_{3} is
where w_{31} and w_{32} are the weight.
4. Variousness: When the training data lacks diversity and is predominantly homogeneous, AI model may become overfitted, limiting its ability to generalize [38]. In fact, the variousness of data can be enhanced by the federated learning. However, too many participants may cause computational waste and increase training costs. Therefore, to ensure the versatility of model across different scenarios, it is essential to evaluate the breadth of data sources θ_{41} and the richness of training data categories θ_{42}
where Ω_{41} is the number of data sources of training data and ℧_{41} is the number of data sources of benchmark data. Additionally, Ω_{42} denotes the number of categories of training data, while ℧_{42} is the number of categories of benchmark data.
Therefore, the variousness of training data can be defined as
where w_{41} and w_{42} signify the connect weights.
5. Equalization: The significant discrepancy in the quantity of samples across different categories can result in discriminatory decision outcomes from the AI models [39]. Thus, to prevent such biases, it is essential to evaluate and ensure the equalization of training data
where ζ_{max} and ζ_{min} are the maximum value of the number of categories respectively.
6. Logicality: The logicality can be utilized to evaluate whether the relationship between features in training data align with factual or commonsense knowledge, which certain data errors may remain undetected during data accuracy assessment [40]. Logical relationships between features encompass comparisons like ‘greater than’, ‘less than’, ‘equal to’, and the like. For example, if feature A and feature B, when multiplied, are expected to be greater than or equal to feature C. If their product falls short of feature C, it indicates a logical inconsistency. Based on the priori knowledge, logicality can be given as
where Ω_{61} represents the number of samples with logical errors.
7. Fluctuation: To investigate the periodic variation or distribution pattern of training data, fluctuation evaluation can be used to assess the difference of historical samples and latest samples [41]. The evaluation formula for quantifying these fluctuations is expressed as
where ℧_{71} is the sum of historical samples, while Ω_{71} signifies the sum of latest samples, . denotes an absolute value operation. It is worth noting that, to ensure computational fairness, the number of historical samples should be equal to the number of latest samples.
8. Uniqueness: Repetitive samples have an influence for the outputs of AI models, often causing the overfitting [42]. Nevertheless, simply eliminating these samples from the training data could compromise the generalization ability of the models. Therefore, data uniqueness assessment is required as a basis for deletion without affecting the performance of models
where Ω_{81} is the number of repetitive samples in training data.
9. Timeliness: AI models need to be trained with the latest data to ensure optimal performance [43]. Failure to update the data in a timely manner may deprive the model of access to the latest information, ultimately diminishing the accuracy of its predictions or decisions. Therefore, timeliness of data is critical to maintain the effectiveness of the AI model. Timeliness of data includes the distributional shifts of data and the update frequency of data. The distributional shifts of data are given as
where
Ŝ_{i} is the average value of the ith feature of dataset at time t, and τ indicates the time interval. Furthermore, the update frequency of data is
where T_{B} represents the ideal number of update time period and T_{N} is the time period that has not been updated. Thus, the timeliness of the data can be described as
where w_{91} and w_{92} are the connection weights.
10. Standard: By standardizing data formats and naming conventions, the readability and comprehensibility of the data are significantly improved. This process can help researchers rapidly familiarize these data, to reduce the complexity and time required for data processing [44]. Furthermore, data reliability can be enhanced by standardizing data source, collection processes, data storage and data training. The standard of data source means that the data source channel is legitimate. Moreover, the standard of the collection processes and data storage means that the data is protected by encryption algorithm such as the differential privacy method and federated learning. The standard of data training refers to the standard in the process of using online or offline data to train AI models, such as the homogeneous computation or heterogeneous computing. Based on the manual judgement, θ_{10} score of standard can be given as
where θ_{10,1} is the format standard score, θ_{10,2} is the naming standard score,θ_{10,3}, θ_{10,4}, θ_{10,5} and θ_{10,6} indicate scores for data source channel, collection process, data storage channel and data training respectively, w_{10} = [w_{10,1},w_{10,2},…,w_{10,6}] represents the weight set.
Remark
Ten key dimensions have been elaborately designed to evaluate data quality. By calculating the score of each dimension, researchers have a clearer understanding for data condition. This approach not only offers valuable guidance but also provides feedback on the effectiveness of training data enhancement. Moreover, the use of simple naming conventions for dimensions, such as serial numbering, simplifies the expansion of dimensions or the addition of subitem within each dimension in future studies. It is worth noting that to better align with realworld scenarios, certain dimensions incorporate subjective evaluation criteria. In the following discussion, we will explore strategies to minimize the influence of subjectivity.
Comprehensive evaluation for data quality
Based on the dimensions discussed above, data quality will be comprehensively evaluated in this section. In fact, evaluating multidimension problems is difficult, due to the potential fuzziness and subjectivity inherent in certain dimensional values. Moreover, all dimensions have to be considered simultaneously, coupled with varying degrees of importance among them, which makes the problem more complicated. Fuzzy comprehensive evaluation model offers a solution. It can transform qualitative evaluation into quantitative evaluation to make an overall evaluation of things or objects constrained by multiple dimensions, based on fuzzy mathematics. Therefore, this model is introduced into the proposed hierarchical evaluation system, to solve difficulttoquantify and nondeterministic problems, by better achieving a harmonious integration of subjectivity and objectivity criteria. Next, fuzzy comprehensive evaluation model will be described in detail.
Determination of evaluation dimensions and evaluation grades
According to the discussion in Section "Design and calculation of dimensions for data quality evaluation", the dimension: completeness, accuracy, consistency, variousness, equalization, logicality, fluctuation, repeatability, timelines, and standard, can be applied for data quality evaluation. However, training data in different application scenarios should be evaluated using the appropriate dimensions. Thus, the evaluation dimensions can be defined as θ = {θ_{1}, θ_{2}, θ_{3}, …, θ_{n}}, and n ͼ [1, 10]. The evaluation grades can be given as v = {v_{1}, v_{2}, v_{3}, …,v_{m}} and m is the number of evaluation grades. For example, when m is four, set v is {excellent, good, pass, poor}.
Construction of evaluation matrix and weight vector
Firstly, the evaluation matrix R, where the membership degree of dimensions in set θ for each evaluation grade in set v, needs to be determined
where r_{ij} is membership degree of ith dimension for the jth evaluation grade and can be given as by adopting the membership function. It is important to highlight that distinct membership functions should be employed in different scenarios. After the evaluation matrix is calculated, weight vector also needs to be determined.
Each dimension has different roles and positions in the evaluation of data quality. Therefore, the weight vector a = (a_{1}, a_{2}, …, a_{n}), a_{1}≧0, ∑a_{i} = 1, need to be designed, which can express the importance of each dimension relative to the problem to be evaluated. Since the requirements for each dimension are discrepant in different scenarios. Analytic hierarchy process (AHP) [45] is suitable for the determination of the weights in this paper, which is an effective analytical method that harmoniously combines qualitative and quantitative methods for solving complex problems with multiple objectives. In order to save space, the working principle of AHP will not be presented while the computational steps will be given in detail in the Section Results and discussion.
Make comprehensive evaluation
For the relationship matrix R and weight vector a, fuzzy transformation result b can be described as
where vector b = [b_{1}, b_{2}, …, b_{m}] and symbolic < ○ > represents the fuzzy operator. Moreover, in order to make full use of information of R, the weighted average operator can be adopted
Thus, the comprehensive evaluation result CER is
where η_{j} is the source of jth evaluation grade. Moreover, the detailed evaluation processes are summarized in Table 1.
Remark
The data quality evaluation is complex, because many dimensions to be considered together at the same time, and each dimension has a different level of importance. Therefore, the proposed comprehensive evaluation method, incorporating the fuzzy comprehensive evaluation model, can effectively focus on interactions of dimensions to achieve the dynamic balance. Furthermore, the expert knowledge can be employed in evaluation, and meanwhile the influence of subjectivity in entire evaluation process can be mitigated, by leveraging fuzzy mathematics. Thus, this comprehensive evaluation method will accurately reflect the data quality to increase credibility of AI.
Results and discussion
To further demonstrate the procedures and effectiveness of the proposed MDHES, an example on publicly available data sets: intrusion detection scenarioKDDcup 99 dataset is discussed in this section. The simulations are run on Windows 10.0 operating system with a clock speed 2.6 GHz, 4 GB RAM, and Anaconda 3 development environment with Python 3.6 programming language.
Intrusion detection scenario
Intrusion detection can discover the violations of security policies or signs of attacks in the network and system by collecting and analyzing the operational log information. KDDCup99 dataset was derived from a simulated US Air Force LAN with attacking lasting 7 weeks and can be obtained by https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. The 10_percent_ subset is be adopted in this example. Each sample in subset was labeled as normal and four abnormal categories. Furthermore, anomaly data and another 17 attack types appeared in the test data.
Determination of benchmark data
In order to match the actual scenarios, benchmark data are randomly sampled from the training and testing sets at a ratio of 80%. Additionally, 50% data is randomly sampled from the benchmark data as a validation sample to validate the effectiveness of the AI model (fuzzy neural network, FNN). The distribution details are summarized in Table 2.
Determination of evaluation dimensions
For the training data in intrusion detection scenario, evaluation dimensions are analyzed and used to evaluate, consisting of completeness, accuracy, consistency, variousness, equalization, logicality and uniqueness. Completeness, variousness and equalization can essentially guarantee adequacy and richness of intrusion detection data, thereby increasing the generalization and stability of the AI model. Accuracy, consistency and logicality can ensure that the intrusion data, learned by AI models during training, is real and effective, thereby improving the accuracy of the model in identifying intrusion behaviors. Moreover, the fluctuation, timeliness and standard are not evaluated in this experiment. The reasons are as follows:
Fluctuation: This intrusion detection data does not conform to periodic change. Thus, it is not possible to perform the fluctuation assessment.
Timeliness: This training data, although from 1999, provides a foundation for research on network intrusion detection based on computational intelligence. However, according to the requirements of timeliness, the evaluation scores would be very low. Therefore, the timeliness evaluation is not done in this experiment, to avoid the impact on the final evaluation results.
Standard: The standard of training data cannot be assessed, because this data acquisition process and storage channel is not known.
Evaluation results and analyse for intrusion detection
Dimension evaluation for training data: Based on the completeness, accuracy, consistency, variousness, equalization, logicality and repeatability, KDDCup99 intrusion detection data was evaluated with equal weights for each evaluation dimensions and the details are summarized in Table 3. For completeness of training data, obviously, the scores of features and feature values are 100. The score of data size is the 76.71 with the confidence level 4%, thus the completeness is the 92.24. In addition, the accuracy is 96, where 18,612 anomalous data are filtrated according Eq. (6). No contradiction has been discovered between training data and baseline data by comparing each sample, then the score of consistency is 100. Due benchmark data are randomly selected from the training and testing sets, it means that multisource score of training data is 50. Moreover, there are 23 categories of training data, and 39 categories of benchmark data, the diversity is 56. Therefore, the variousness score is the 53. Intuitively, there is a significant difference in the number of categories of training data and the equalization score is only 1 according to Eq. (13). It can be determined that logicality score is 100 by calibration. Additionally, there are 348,435 duplicate entries in the training data, and the score for uniqueness is calculated to be 30.
Comprehensive evaluation for training data: Based on the above discussion of each dimension, data quality can be comprehensively evaluated in this section. In our experiment, data quality is divided into four evaluation grades: poor, pass, good, excellent. The evaluation matrix R, where the membership degree of dimensions for each evaluation grade, needs to be calculated. First, membership function will be determined. For the “poor” grade, the lower the score of dimension, the greater the degree of belonging to it. Thus, “poor” should be a gradually decreasing function. Similarly, “excellent” is a gradually rising function. Both “pass” and “good” are functions that go up and then go down. Based on the above analysis, trapezoidal membership function is adopted with a characteristic of simple calculation. The distribution can be given as
where δ_{1}, δ_{2}, δ_{3}, and δ_{4} are the critical values at the four levels of poor, pass, good, excellent, respectively.
For the intrusion detection scenario, the completeness, accuracy, consistency and logicality are critical, which will have a direct impact on the FNN model. However, equalization is hard to guarantee, due to the difficulty and the cost of attacking. Uniqueness is also difficult to obtain satisfactory results, because of repeated attack means in this scenario. Therefore, the grade distribution of each dimension is as follows.
According to Table 4, the membership functions of dimensions can be determined. For examples, membership functions of completeness for four grades are
thus, the fuzzy set of completeness is [0, 0.276, 0.724, 0]. Sequentially, evaluation matrix R can be given
Next, the weight vector a will be depicted by adopting AHP. In this AHP, judgment matrix C is first constructed and the element c_{ij} of C are given using 1–9 scale method proposed by Saaty.
Based on Table 5, the judgment matrix C, by a twobytwo comparison of completeness, accuracy, consistency, variousness, equalization, logicality and repeatability, can be defined as
where CR is the consistency ratio, λ_{max} is the absolute value of the largest eigenvalue of the matrix C and τ is the number of nonzero eigenvalues order. For matrix C, λ_{max} is 7.588, τ is the 7. Additionally, when the eigenvalue is 7, the corresponding RI is 1.32, which is directly given by Saaty. Therefore, value of CR is 0.074 and passes the consistency testing. The maximum vector of the maximum eigenvalue λmax is [– 0.15, – 0.43, – 0.69, – 0.17, – 0.04, – 0.53, – 0.048]. Then, after normalization, the weight vector is [0.18, 0.25, 0.25, 0.04, 0.04, 0.12, 0.12]. Therefore, the fuzzy transformation result is [0.1, 0.1, 0.47, 0.31]. The scores corresponding to four evaluation grades are [40, 60, 80, 90]. Hence, the comprehensive evaluation result of training data of intrusion detection is 75.5.
In addition, to better compare the advantages of the fuzzy comprehensive evaluation method, it is compared with the weighted average evaluation method, where each dimension is equally weighted. The calculated score for data quality is 71.2. In this scenario, the score is 4 points lower, compared to the score of fuzzy comprehensive evaluation. However, the fuzzy comprehensive evaluation method produces the final evaluation result by setting unequal weights for each evaluation dimension. This weight setting method can reflect the relative importance between the evaluation dimension, making the results more scientific and reasonable. Besides, it can effectively deal with various vague and uncertain information. These advantages make the fuzzy comprehensive evaluation method prominent and become one of the effective means to solve practical evaluation problems.
Based on the above discussion, equalization and uniqueness are the main reasons for causing the unsatisfactory score of data quality. Therefore, some data processing is required for the data quality improvement. After deduplication and simple random sampling [46], the distributions of all data sets are shown in Figs. 2, 3. It can be seen that, to improve the equalization, the number of category R2L is decreased, while other categories are increased. Moreover, the evaluation results of improved training data are summarized in Table 5 and the comparison with the raw training data is displayed in Fig. 4.
Based on Fig. 4 and Table 6, the accuracy is improved by 2% comparing with the raw training data. Additionally, the scores of variousness, equalization and uniqueness have been improved by at least 50%. It is worth noting that we don’t completely remove duplicates, to maintain a balance between the equalization and uniqueness, due to the limitations in benchmark database. The comprehensive score of the processed training data is 89.2, whose quality has significantly improved.
In order to further demonstrate the impact of training data improvement on model performance, FNN will be used in the validation data. Moreover, the simulation results of FNN on the processed training data (FNNP) are compared with FNN on the raw training data (FNNR), multilevel hybrid support vector machine and extreme learning machine based on modified kmeans (SVM + ELM + MK) [47], back propagation (BP) network, improved longshort term memory (ILSTM) [48], and improved sparse denoising autoencoder (ISSDA) [49]. To make a fair comparison, the accuracy T_{A}, false positive rate T_{F}, and false negative rate T_{M} are introduced. The results are displayed in Figs. 5–7 and the details are shown in Table 7.
The accuracy of different AI models is displayed in Fig. 5. The false positive rate, and false negative rate are shown in Figs. 6 and 7. Combining with the Table 7, it can be seen that FNNP has the significant improvement on all categories, whose performance is shown in bold in Table 7, such as the accuracy of ‘Normal’ is improved by approximately 8% and the false rate of ‘Dos’ is reduced to 3.6%, comparing with the FNNR. It indicates that data quality plays a crucial role in model performance improvement. Moreover, FNNP performs better than the meticulously designed SVM + ELM + MK, especially in categories ‘U2R’ and ‘R2L’. In addition, the effectiveness of FNNP can be comparable to that of improved deep networks (ILSTM and ISSDA). Based on the above analysis, the proposed MDHES has a magnitude significance. The strengths and weaknesses of intrusion detection can be clearer understood by quantifying the score of each dimension, which will provide the guidance and feedback on the data quality improvement for the researchers. Furthermore, the proposed comprehensive evaluation method can achieve the dynamic balance of dimensions to avoid the resource waste for excessive pursuing a certain dimension improvement, such that the score equalization is only 58, FNN still achieves good performance.
Practical application of multidimension comprehensive evaluation system
In recent years, cybertelecoms fraud has been a seriously social problem, threatening the property safety of the people, with the characteristics of frequent occurrence, rapid increase, and repeated prohibition. Deep neural networks (DNNs) are the powerful tool to effectively distinguish the fraudulent behaviour. However, this tool is highly questionable, because that the effects are unable to determine, which may cause the unbearable consequences. Therefore, the proposed MDHES are investigated from the data quality aspect for intelligent identification of cybertelecoms fraud to enhance the credibility of DNNs.
Fraud data from four provincial companies in China has been obtained to form a benchmark database after data processing (deduplication, denoising, and complement of missing values, etc.), which will be encrypted by the differential privacy way and uploaded to the data management library of this proposed evaluation system. The data features can be summarized into three categories: network management data, signaling data, and call ticket data. The network management data contains the IP address, MAC address, port number and so on. Signaling data is some location related information. The features of call ticket data include the number of calls, total call duration, maximum call duration, minimum call duration and last call duration. The important features of this benchmark database are 32 after selection, where the some features are very sensitive and required not to disclose publicly. Moreover, the number of total data is 500 million, contains the approximately 400 million normal samples and 100 million fraud samples. There are mainly types of fraud: traditional fraud (gambling, pornography, pyramid schemes) and new fraud (investment and financial management, pig killing, loans, brushing orders, counterfeit industry and commerce, counterfeit public security, procurator and judicial authorities, counterfeit customer service and so on).
In this scenario, we adopt nine dimensions, except the fluctuation due to the irregular occurrence of fraud activities, to evaluate the training data. Firstly, uploading the training data file with the 40 million samples to the proposed MDHES (seen as the Fig. 8). Then, Clicking the ‘Create’ button, evaluation task can be created. Next, nine dimensions are selected, and the task starts by clicking the ‘Start’ button. The evaluation results are shown in Fig. 9. It can be seen that the data size in completeness is 58.16 and the variousness is 75. Therefore, the dimensions: completeness, consistency, uniqueness and timeliness have unsatisfactory scores, which needs to be further improved and repeat the evaluation process. Additionally, in order to clearly demonstrate the changes in the scores of the raw and improved data samples, the comparisons are displayed in Fig. 10, and the details are summarized in Table 7.
It can be seen the completeness of training data is improved by 10% in Fig. 10 and Table 8. Consistency and timeliness are two important dimensions for fraud identification DNNs, due to the replacement of objects. For example, the previously normal URL has become a malicious URL, which will lead to a contradiction of label value. Therefore, the consistency is enhanced by correcting contradiction samples based on the latest fraud data. Of course, the timeliness is also increased. Moreover, the comprehensive score for data quality is 92.6, which is enhanced by approximately 8.3%, comparing with the original training data. To reveal the impact of data quality change on DNNs, SDA and LSTM are adopted, where the basic building units of SDA are 32, and the basic units of LSTM are 21. The performances on the validation data are shown in Fig. 11.
As shown in Fig. 11, the average accuracy, average false positive rate, average and false negative rate of DNNs have the positive improvement. The average accuracy of DNNs is 89.46, which has increased by 13% comparing to the original data. Furthermore, the average T_{F} is reduced to 8.19% and the average T_{M} is also 2.61%. Based on the evaluation results and performance, this intelligent identification model meets the actual application requirements and is deployed into the business of cybertelecoms fraud prevention. The actual application is shown in Fig. 12.
It can be seen that the total number of fraudulent websites today, the total number of fraudulent websites within a week, the number of interceptions, and the total number of interceptions within a week. Moreover, the top twelve types of fraudulent websites are given, and at the same time, blocking websites with higher rankings are also displayed. In the first quarter of 2024, the cybertelecoms fraud prevention system effectively identified 3 million + fraudulent websites and blocked 46 billion + illegal visits, to effectively reduce the incidence of fraud and protect citizens’ lives and properties.
Conclusion
In this paper, a MDHES is proposed to estimate the data. In this evaluation system, multiple crucial dimensions are designed to evaluate data condition separately, which can provide a clearer understanding for improvement. Then, a comprehensive evaluation method, incorporating a fuzzy evaluation model, is developed to synthetically evaluate the data quality to achieve the dynamic balance and meanwhile mitigates the impact of subjectivity on the comprehensive evaluation result. Finally, experiment results and comparisons, in intrusion detection benchmark problem and real intelligent application identification of cybertelecoms fraud, demonstrate the effectiveness of the proposed MDHES, which achieves an accurately and thoroughly data quality assessment to provide strong data support for trustworthy AI. In addition, when there are multiple dimensions (more than 9), the workload of AHP scaling is too large, which can easily cause confusion in judgment. Therefore, in future research, on the hand, we will focus on improving the comprehensive evaluation method to better assess the quality of data. On the other hand, more dimensions and subitems for the big model will be considered to improve our evaluation system for promoting the application of AI in more practical scenarios in a safe and efficient manner.
Data availability
KDDCup99 dataset was derived from a simulated US Air Force LAN with attacking lasting 7 weeks and can be obtained by https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Fraud dataset is owned by a third party. The data underlying this paper were provided by [third party] under licence/ by permission. data will be shared on request to the corresponding author with permission of [third party].
Abbreviations
 AI:

Artificial intelligence
 MDHES:

Multidimensional hierarchical evaluation system
 FDA:

Fuzzy denoising autoencoder
 BANG:

Batch adjusted network gradients
 SPAM:

Scalable polynomial additive model
 SEBPER:

Semanticenhanced bayesian personalized explanation ranking
 AutoML:

Automated machine learning
 FNN:

Fuzzy neural network
 AHP:

Analytic hierarchy process
 FNNP:

Processed training data
 FNNR:

Raw training data
 SVM +ELM+MK:

Support vector machine and extreme learning machine based on modified kmeans
 BP:

Back propagation
 ILSTM:

Improved longshort term memory
 ISSDA:

Improved sparse denoising autoencoder
References
Ahmed I, Jeon G, Piccialli F. From artificial intelligence to explainable artificial intelligence in industry 4.0: a survey on what, how, and where. IEEE Trans Ind Inform. 2022;18(8):5031–42.
Putra MA, Ahmad T, Hostiadi DP. BCAT: a model for detecting botnet attacks using deep attack behavior analysis on network traffic flows. J Big Data. 2024;11(1):49.
Zhang HJ, He S, Chen J. A hierarchical authentication system for access equipment in internet of things. Int J Intell Syst. 2023;1:1–11.
Furman J, Seamans R. AI and the economy. Innov Policy Econ. 2019;19(1):1–191.
Yu KH, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng. 2018;2:719–31.
Khan A. Role of artificial intelligence in carfollowing and lane change models for autonomous driving. Adv Hum Asp Transp. 2018;9:307–17.
Schetinin V, Li D, Maple C. An evolutionarybased approach to learning multiple decision models from underrepresented data. In: Schetinin V, editor. 2008 Fourth international conference on natural computation, vol. 1. Jinan: IEEE; 2008.
Vavra P, Baar JV, Sanfey A. The neural basis of fairness. Interdiscip Perspect Fairness Equity Justice. 2017;5:9–31.
Liu HC, Wang YQ, Fan WQ, et al. Trustworthy AI: a computational perspective. ACM Trans Intell Syst Technol. 2022;14(1):1–59.
Chatila R, Dignum V, Fisher M, et al. Trustworthy AI. Reflect Artif Intell Hum. 2021;12600:13–39.
Malchiodi D, Raimondi D, Fumagalli G, et al. The role of classifers and data complexity in learned bloom flters: insights and recommendations. J Big Data. 2024;11(45):1–26.
Han HG, Zhang HJ, Qiao JF. Robust deep neural network using fuzzy denoising autoencoder. Int J Fuzzy Syst. 2020;22(6):1356–75.
Rozsa A, Gunther M, Boult TE. Towards robust deep neural networks with BANG. In: IEEE winter conference on applications of computer vision (WACV)2018.
Yampolskiy R. Unexplainability and Incomprehensibility of AI. J Artif Intell Conscious. 2020;7(2):1–15.
Guidotti R, Monreale A, Ruggieri S, et al. A survey of methods for explaining black box models. ACM Comput Surv. 2019;51(5):1–42.
Dubey A, Radenovic F, Mahajan D. Scalabl interpretability via polynomials. Neural Inform Process Syst. 2022;1:1–26.
Meng ZL, Wang MH, Bai JJ, et al. Interpreting deep learningbased networking systems. IEEE Commun Surv Tutor. 2019;21(3):2702–33.
Nwafor O, Okafor E, Aboushady AA, et al. Explainable artificial intelligence for prediction of nontechnical losses in electricity distribution networks. IEEE Access. 2023;11:73104–15.
McClure P, Moraczewski D, Lam KC, et al. Improving the interpretability of fMRI decoding using deep neural networks and adversarial robustness. Apert Neuro. 2023;3:1–17.
Fernandes FE, Yen GG. Automatic searching and pruning of deep neural networks for medical imaging diagnostic. IEEE Trans Neural Netw Learn Syst. 2021;32(12):5664–74.
Barreiro E, Munteanu CR, Monteagudo MC, et al. Net–net auto machine learning (AutoML) prediction of complex ecosystems. Sci Rep. 2018;8(12340):2685–96.
Elliott A. What data scientists tell us about AI model training today, Alegion, 2019; 1–10.
Forrester Consulting. Overcome obstacles to get to AI at scale. IBM. 2020; 1–12.
Kortylewski A. Analyzing and reducing the damage of dataset bias to face recognition with synthetic data. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2019; pp. 2261–2268.
Jackson A, The state of open data science 2020, Digital Science. 2020: 1–30.
Andrew NG A Chat with Andrew on MLOps: From Modelcentric to Datacentric AI. 2022. https://cloud.google.com/solutions/machinelearning/mlopscontinuousdeliveryandautomationpipelinesinmachinelearning.
Zhang D, Lai H. Datacentric artificial intellgence: a survey. 2023; 1–39.
Zhu SC. Making mathematical models for the humanities: Chinese thought from the perspective of artificial general intelligence. J Mod Stud. 2024;3(1):42–66.
Liang WX, Tadesse GA, Ho D, et al. Advances, challenges and opportunities in creating data for trustworthy AI. Nat Mach Intell. 2022;4:669–77.
Artamonov I, Deniskina A, Filatov V, et al. Quality management assurance using data integrity model. Matec Web Conf. 2019. https://doi.org/10.1051/matecconf/201926507031.
Caballero I, Serrano M, Piattini M. A data quality in use model for big data. Adv Concept Model. 2014;8823:65.
Cai L, Zhu YY. The challenges of data quality and data quality assessment in the big data era. Data Sci J. 2015;14(2):78–92.
Hongxun T, Honggang W, Kun Z. Data quality assessment for online monitoring and measuring system of power quality based on big data and data provenance theory. In: Hongxun T, editor. 2018 IEEE 3rd international conference on cloud computing and big data analysis. Chengdu: IEEE; 2018. p. 248–52.
Cai L, Zhu YY. The challenges of data quality and data quality assessment in the big data era. Data Sci J. 2015;14:69–87.
Barchard KA, Verenikina Y. Improving data accuracy: selecting the best data checking technique. Comput Hum Behav. 2013;29(5):1917–22.
Li ZT, Sun JB, Yang KW, Xiong DH. A review of adversarial robustness evaluation for image classification. J Comput Res Dev. 2022;59(10):2164–89.
Khalfi B, de Runz C, Faiz S, Akdag H. A new methodology for storing consistent fuzzy geospatial data in big data environment. IEEE Trans Big Data. 2021;7(2):468–82.
Wang S, Yao X. Relationships between diversity of classification ensembles and singleclass performance measures. IEEE Trans Knowl Data Eng. 2013;25(1):206–19.
Chae JH, Jeong YU, Kim S. Datadependent selection of amplitude and phase equalization in a quarterrate transmitter for memory interfaces. IEEE Trans Circuits Syst. 2020;67(9):2972–83.
Yao W. Research on static software defect prediction algorithm based on big data technology. In: Yao W, editor. 2020 International conference on virtual reality and intelligent systems (ICVRIS). Zhangjiajie: IEEE; 2020. p. 610–3.
Kim KY, Park BG. Effect of random dopant fluctuation on data retention time distribution in DRAM. IEEE Trans Electron Devices. 2021;68(11):5572–7.
Widad E, Saida E, Gahi Y. Quality anomaly detection using predictive techniques: an extensive big data quality framework for reliable data analysis. IEEE Access. 2023;11:103306–18.
Xia Q, Xu Z, Liang W, Yu S, et al. Efficient data placement and replication for QoSaware approximate query evaluation of big data analytics. IEEE Trans Parallel Distrib Syst. 2019;30(12):2677–91.
Lee D. Big data quality assurance through data traceability: a case study of the national standard reference data program of Korea. IEEE Access. 2019;7:36294–9.
Ge Z, Liu Y. Analytic hierarchy process based fuzzy decision fusion system for model prioritization and process monitoring application. IEEE Trans Industr Inf. 2019;15(1):357–65.
Antal E, Tillé Y. Simple random sampling with overreplacement. J Stat Plann Inference. 2011;141(1):597–601.
AlYaseen WL, Othman ZA, Nazri MZA. Multilevel hybridsupport vector machine and extreme learning machine based on modified Kmeans for intrusion detection system. Expert Syst Appl. 2017;67:296–303.
Zhang L, Yan H, Zhu Q. An improved LSTM network intrusion detection method. In: Zhang L, editor. 2020 IEEE 6th international conference on computer and communications (ICCC). Chengdu: IEEE; 2020.
Guo XD, Li XM, Jing RX, et al. Intrusion detection based on improved sparse denoising autoencoder. J Comput Appl. 2019;39(3):769–73.
Acknowledgements
Zhang Yusheng is acknowledged for his consulting assistance and silent accompany provided to the first author of this manuscript.
Funding
This work was supported by Research and standardization of key technologies for 6G general computing and intelligent integration R241149BC03, Research on 6G Trusted Endogenous Security Architecture and Key Technologies Grant R24113V7, and Research and Standardization of Key Technologies for 6G General Computing and Intelligent Integration Grant R241149B.
Author information
Authors and Affiliations
Contributions
H.J. is the main designer of the proposed evaluation system and was also a major contributor in writing the manuscript. C.C., P.R. and K.Y. coordinates with other parties to obtain and analyze the fraud data. Z.Y., J.C., Q.C. and J.K. are mainly responsible for the web development of this multidimensional comprehensive evaluation system and realtime display screen of cybertelecoms fraud prevention. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Zhang, HJ., Chen, CC., Ran, P. et al. A multidimensional hierarchical evaluation system for data quality in trustworthy AI. J Big Data 11, 136 (2024). https://doi.org/10.1186/s40537024009992
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537024009992