A multi-dimensional hierarchical evaluation system for data quality in trustworthy AI

Zhang, Hui-Juan; Chen, Can-Can; Ran, Peng; Yang, Kai; Liu, Quan-Chao; Sun, Zhe-Yuan; Chen, Jia; Chen, Jia-Ke

doi:10.1186/s40537-024-00999-2

Methodology
Open access
Published: 27 September 2024

A multi-dimensional hierarchical evaluation system for data quality in trustworthy AI

Hui-Juan Zhang^1,2,3,
Can-Can Chen^1,2,
Peng Ran^1,3,
Kai Yang^1,3,
Quan-Chao Liu^1,2,
Zhe-Yuan Sun^1,2,
Jia Chen^1,2 &
…
Jia-Ke Chen^1,3

Journal of Big Data volume 11, Article number: 136 (2024) Cite this article

270 Accesses
Metrics details

Abstract

Recently, the widespread adoption of artificial intelligence (AI) has given rise to a significant trust crisis, stemming from the persistent emergence of issues in practical applications. As a crucial component of AI, data has a profound impact on the trustworthiness of AI. Nevertheless, researchers have struggled with the challenge of rationally assessing data quality, primarily due to the scarcity of versatile and effective evaluation methods. To address this trouble, a multi-dimensional hierarchical evaluation system (MDHES) is proposed to estimate the data quality. Initially, multiple key dimensions are devised to evaluate specific data conditions separately by the calculation of individual scores. Then, the strengths and weaknesses among various dimensions can be provided a clearer understanding. Furthermore, a comprehensive evaluation method, incorporating a fuzzy evaluation model, is developed to synthetically evaluate the data quality. Then, this evaluation method can achieve a dynamic balance, and meanwhile achieve a harmonious integration of subjectivity and objectivity criteria to ensure a more precise assessment result. Finally, rigorous experiment verification and comparison in both benchmark problems and real-world applications demonstrate the effectiveness of the proposed MDHES, which can accurately assess data quality to provide a strong data support for the development of trustworthy AI.

Introduction

With the rapid development of computer software and hardware technology, artificial intelligence (AI) has made significant breakthroughs, which is increasingly applied in multiple fields of human production and life [1,2,3]. AI has been proven particularly professional to predict stock prices or the stock tendency in the financial field [4]. In the medical field, AI can assist doctors to diagnose diseases and perform surgery [5]. Moreover, AI can identify real-time environmental information for path planning, thus enhancing the likelihood of early arrival of unmanned vehicles in automated driving [6]. However, as the widespread application of AI, the continuous issues, such as the underrepresentation of data or the unfairness of model outputs, have become an obstacle to permeate AI into the actual scenarios [7, 8].

To solve aforementioned issues, trustworthy AI has emerged as a thoroughly important scientific research direction [9]. Over the past decades, numerous investigators have concentrated their efforts on optimizing model structures or enhancing learning algorithms in order to enhance the credibility of AI [10, 11]. For example, Han et al. introduced fuzzy set theory to devise a new building-unit, named fuzzy denoising autoencoder (FDA). This novel unit was used to construct fuzzy deep network [12]. The results displayed that this FDA can extract more robust features compared with the basic unit in traditional DNN, to mitigate the effect of uncertainties. Moreover, Rozsa et al. proposed the batch adjusted network gradients (BANG) for model training, leading to improvements in model accuracy [13]. However, it is undeniable that AI has been criticized in recent years for a major weakness: the lack of interpretability in its decision-making processes [14, 15]. To solve this problem, Dubey et al. devised a scalable polynomial additive model (SPAM) [16]. This SPAM employed tensor rank decompositions of polynomials to create an inherently-interpretable model. In addition, Meng et al. proposed a semantic-enhanced Bayesian personalized explanation ranking (SE-BPER) model, leveraging the interaction information and semantic information [17]. In this SE-BPER model, interaction information was utilized to form a latent factor representation by constructing the interaction matrix. Then, the semantic information was adopted to optimize this factor representation, thereby improving the rationality of decision results. Additional research efforts aimed at the performance improvement of neural network can be found in [18,19,20]. It imperative to note that the foundation of all these methods, ranging from [12] to [20], is trustworthy of data they adopted. With the maturity of AI technology, including the emergence of automated machine learning (AutoML) platform and industry-standard platforms like PyTorch, it is much easier to develop and improve models when the data are provided than before [21]. Nevertheless, according to surveys conducted in [22, 23], it is staggering that 96% of enterprises encounter challenges related to data quality or labeling in their AI projects, while 40% of them lack the confidence for ensuring the data quality. If a model captures the biases and incorrect correlations of data, it adheres to the principle that “garbage in, garbage out,” and will have a significant impact on the credibility of AI [24]. In fact, the breakthrough of AI benefits from the development of high-quality data. which means that the data quality has become absolutely crucial.

In particular, as the emphasis transitions from model optimizing to data improvement, data scientists often spend nearly twice the time on data loading, cleansing and visualization in comparison to model training, selection and deployment [25]. Andrew Ng, a respected figure in the field of AI, has emphasized that 80% data plus 20% model can equal better machine learning [26]. That indicating that more outstanding outcomes of AI can be well achieved by enhancing data quality, especially when the model remains fixed. For instance, ChatGPT, the renowned chat robot, demonstrated greater surprise and credibility in its third iteration compared to its predecessor. This advancement was primarily due to multitudinous efforts in acquiring high-quality data for training, rather than substantial modifications to the model's structure [27]. Furthermore, Professor Songchun Zhu, a globally renowned expert in AI field, once proposed “The key to general AI lies in establishing a "heart" for machines. Data provides learning materials and basis for intelligent agents, and is the foundation for their mental formation” [28]. Good data can guide AI towards goodness and trustworthy. Therefore, more scholars are participating in the evaluation and improvement of data quality. In [29], Liang et al. primarily explored the influence of data pipeline on the credibility of AI, encompassing data design (sourcing and documentation), data sculpting (selection, cleaning and annotation), and data strategies for model testing and monitoring. Their work offers valuable theoretical insights into significance of data quality. However, a specific implementation was not given. Additionally, a data protection strategy, using the backdoor watermarking, was given to authenticate data ownership [30]. This proposed protection method employs poison backdoor attacks for data watermarking. Then, a hypothesis-test-guided method was utilized for verification. The experiments demonstrated that this protection method can effectively prevent data theft to enhance reliability of AI. Caballero proposed a 3Cs model, which is composed of three data quality dimensions for assessing the quality of data: contextual consistency, operational consistency and temporal consistency [31]. Nevertheless, it only focuses on a single dimension of data, potentially overlooking other crucial aspects. A multi-dimensional assessment could provide a more overall understanding of AI credibility. For examples, the significance of data integrity and consistency as pivotal quality assurance dimensions was emphasized together [32]. In addition, a comprehensive assessment methodology, encompassing multiple dimensions, was introduced to estimate the integrity, redundancy, accuracy, timeliness, intelligence and consistency of power data [33]. The result demonstrated that this assessment method can provide a foundation for data analysis and data mining to facilitate the trustworthy AI in power system.

However, the following problems should be future discussed.

1.
The evaluation dimensions are currently qualitative, meaning we understand the actions required but lack clarity on the rationale behind them or their potential scope. Practical application guidelines remain somewhat ambiguous. In data-driven scenarios, the performance of AI models is crucial and complex. Due to the qualitative nature of evaluation dimensions, it is often difficult to accurately measure how these variables individually or synergistically affect the model, making it difficult to develop effective optimization strategies for the improvement of AI models.
2.
When data is input into an AI model, its performance is not only influenced by a single dimension of the data, but also by multiple factors. The intricate interaction between various dimensions is often overlooked or simplified in the current evaluation methods, resulting in an inability to comprehensively and deeply understand the data and evaluate data. Moreover, a further consideration is needed that the harmonious integration of subjectivity and objectivity criteria to ensure a more precise assessment of data quality.

In order to solve this problem, a multi-dimensional hierarchical evaluation system (MDHES) is proposed to estimate the data quality in this paper. The main contributions of this paper are as follows.

1.
Multiple dimensions are designed to evaluate data condition separately. Innovatively, the evaluation dimensions have been not only given, but also the specific quantitative formulas, which can provide a clearer understanding of the strengths and weaknesses among these dimensions, by calculating the individual scoring. Moreover, the effects of data improvements can also be explicitly measured, which offers a constructive guidance and feedback on the implementing enhancements for researchers.
2.
A comprehensive evaluation method, incorporating fuzzy evaluation model, is developed to synthetically evaluate the data quality, based on the score value of each dimension. This method focuses on interactions of dimensions to achieve the dynamic balance. Furthermore, the adoption of fuzzy evaluation model can achieve a harmonious integration of subjectivity and objectivity criteria to more accurately reflect the data quality.

The outline of this paper is organized as follows. Section "Multi-dimensional hierarchical evaluation system" introduces the proposed MDHES, consisting of quantization of evaluation dimensions and the comprehensive evaluation method in detail. Subsequently, the experimental results and the comparisons on the benchmark problems are discussed to demonstrate the effectiveness of the proposed MDHES in Section "Results and discussion". Section "Practical application of multi-dimension comprehensive evaluation system" highlights real-world applications implemented for cyber-telecoms fraud identification, while the conclusions are given in Section "Conclusion".

Multi-dimensional hierarchical evaluation system

In this section, a MDHES is proposed for data quality evaluation. As shown in Fig. 1, multiple crucial dimensions, consisting of completeness, accuracy, consistency, variousness, equalization, logicality, fluctuation, uniqueness, timeliness, standard, which encompasses the entire data pipelines (data processing, data usage, data storage and more), are meticulously designed to evaluate data condition separately by calculating individual score of each dimension. Then, a comprehensive evaluation method, integrating individual scores, is developed to provide a synthetically evaluation for the data quality. Next, the proposed MDHES will be introduced in detail.

Design and calculation of dimensions for data quality evaluation

1. Completeness: Data completeness refers to the absence of any gaps or missing values within a training data set. Training data containing miss values will lead to AI models yielding inaccurate assumptions, primarily because the incomplete data only provides a partial information, potentially causing costly mistakes and significant waste of resources [34]. The data completeness encompasses several aspects: the comprehensiveness of features, the fullness of feature values, and the adequacy of data size. For the comprehensiveness of features, θ₁₁ can be given as

$$\uptheta_{11} = \min (1, \, {{\Omega_{ \, 11} } \mathord{\left/ {\vphantom {{\Omega_{ \, 11} } {\mho_{ \, 11} }}} \right. \kern-0pt} {\mho_{ \, 11} }}) \times 100\% , \,$$

(1)

where ℧₁₁ is the number of features in the benchmark data, while Ω₁₁ represents the number of features in the training data. Baseline data is meticulously gathered by measurement organizations tailored to specific scenarios, resulting in a comprehensive data set that meets specific requirements. However, acquiring such data demands considerable effort and financial investment.

The presence of null values will affect the availability of data, undermining the decision-making capabilities of AI, especially when the training data contains too many null values. Therefore, it is imperative to consider the fullness of feature values and can be defined as

$$\uptheta_{12} = ({{{\text{S - }}\Omega_{ \, 12} )} \mathord{\left/ {\vphantom {{{\text{S - }}\Omega_{ \, 12} )} {N_{s} }}} \right. \kern-0pt} {N_{s} }} \times 100\% ,$$

(2)

where Ω₁₂ signifies the number of training data samples with null values, and S indicates the total number of training data sample. Having an adequate amount of training data is crucial, which can help the AI model to learn the underlying patterns.

and essential laws, thereby enhancing its generalization ability. It can be described as

$$v_{i} = \gamma^{2} {*}p_{i} {*(1-p_{i})}/ {\varepsilon^{2},}$$

(3)

where v_i is the ith sample target size in the training data, and γ denotes the z-score associated with the confidence level ε. p_i is the proportion of the ith category in training data. The score of data size can be calculated as

$${\uptheta }_{13} = \min \left(1, \left(\sum\nolimits_{1}^{\Omega_{11}} {s_{i} / v_{i}}\right) \bigg{/} \Omega_{11} \right) \times 100\% ,$$

(4)

where s_i indicates the actual size of ith category. Based on the above analysis, the training data completeness is

$${\uptheta }_{1} = \omega_{11} {\uptheta }_{11} + \omega_{12} {\uptheta }_{12} + \omega_{13} {\uptheta }_{13} ,$$

(5)

where w₁₁, w₁₂ and w₁₃ are the connect weights for each dimension, respectively.

2. Accuracy: The data accuracy is the degree of closeness between the collected data and the actual true values [35]. The perturbation data, such as the false data by the deep fake algorithm, and adversarial sample in dataset, may hinder AI model to recognize the correct patterns, ultimately affecting the accuracy of AI models. Therefore, for the general perturbation data can be discovered by outlier detection technology, such as the 3-sigma criterion

$$x_{ij} \subseteq \left[ {\overline{x}_{j} { - }3} \right.\kappa_{j} ,\overline{x}_{j} { + }3\left. {\kappa_{j} } \right],$$

(6)

where x_ij is the value of ith row and jth column in training data. x̄_i and κ_i represent the average value and standard deviation of ith column respectively. The accuracy of training data θ₂ can be given as

$${\uptheta }_{21} = {{\Omega_{{ \, 2{1}}} } \mathord{\left/ {\vphantom {{\Omega_{{ \, 2{1}}} } S}} \right. \kern-0pt} S} \times 100\% ,$$

(7)

where Ω₂₁ is the number of outliers.

For the carefully designed perturbation data (false data by the deep fake algorithm, the contaminated data), regarded as adversarial sample, the robustness of data can be evaluated by the Adversarial Category Average Confidence (ACAC) [36], which can be defined as the average confidence level of the model classifier in misclassifying

$${\uptheta }_{22} = \left\{ \begin{gathered} {{\sum\limits_{i = 1}^{{i = \Omega_{22} }} {P(F({\text{x}}_{i}^{adv} ) \ne y_{i} )} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{{i = \Omega_{22} }} {P(F({\text{x}}_{i}^{adv} ) \ne y_{i} )} } {\Omega_{22} }}} \right. \kern-0pt} {\Omega_{22} }}{\text{ non target attack }} \hfill \\ {{\sum\limits_{i = 1}^{{i = \Omega_{22} }} {P(F({\text{x}}_{i}^{adv} ) = y_{i} )} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{{i = \Omega_{22} }} {P(F({\text{x}}_{i}^{adv} ) = y_{i} )} } {\Omega_{22} {\text{ target attack }}}}} \right. \kern-0pt} {\Omega_{22} {\text{ target attack }}}} \hfill \\ \end{gathered} \right.$$

(8)

where Ω₂₂ represents the number of successful adversarial sample attacks, xadv iis the ith adversarial sample and y_i is the corresponding label value. Therefore, the accuracy of training data θ₂ can be given as

$${\uptheta }_{2} = {{\Omega_{ \, 2} } \mathord{\left/ {\vphantom {{\Omega_{ \, 2} } S}} \right. \kern-0pt} S} \times 100\% ,$$

(9)

where Ω₂ is the number of outliers.

3. Consistency: Contradictory samples in training data can confuse AI models, preventing them from understanding correct relationships among data set. This confusion can reduce the prediction stability of models across various scenarios and even over time [37]. Therefore, it is crucial to measure the consistency of training data. First, the format of features in training data is verified

$${\uptheta }_{31} = {{{(}S{ - }\Omega_{ \, 31} )} \mathord{\left/ {\vphantom {{{(}S{ - }\Omega_{ \, 31} )} S}} \right. \kern-0pt} S} \times 100\% ,$$

(10)

where θ₃₁ is the score for format consistency, Ω₃₁ indicates the number of samples with different formats for a feature. Moreover, by comparing the values of features and labeled values between each sample, the content consistency for the same matter can be given as

$${\uptheta }_{32} = {{{(}S{ - }\Omega_{ \, 32} )} \mathord{\left/ {\vphantom {{{(}S{ - }\Omega_{ \, 32} )} S}} \right. \kern-0pt} S} \times 100\% ,$$

(11)

where Ω₃₂ denotes the number of samples containing conflicting content. In addition, after content consistency evalution, adversarial samples with the target attacks will also be detected. Therefore, the consistency of training data θ₃ is

$${\uptheta }_{3} = \omega_{31} {\uptheta }_{31} + \omega_{32} {\uptheta }_{32} ,$$

(12)

where w₃₁ and w₃₂ are the weight.

4. Variousness: When the training data lacks diversity and is predominantly homogeneous, AI model may become overfitted, limiting its ability to generalize [38]. In fact, the variousness of data can be enhanced by the federated learning. However, too many participants may cause computational waste and increase training costs. Therefore, to ensure the versatility of model across different scenarios, it is essential to evaluate the breadth of data sources θ₄₁ and the richness of training data categories θ₄₂

$$\begin{gathered} {\uptheta }_{41} = {{\min (1, \, \Omega_{ \, 41} } \mathord{\left/ {\vphantom {{\min (1, \, \Omega_{ \, 41} } {\mho_{41} }}} \right. \kern-0pt} {\mho_{41} }}) \times 100\% , \hfill \\ {\uptheta }_{42} = {{\min (1, \, \Omega_{ \, 42} } \mathord{\left/ {\vphantom {{\min (1, \, \Omega_{ \, 42} } {\mho_{42} )}}} \right. \kern-0pt} {\mho_{42} )}} \times 100\% , \hfill \\ \end{gathered}$$

(13)

where Ω₄₁ is the number of data sources of training data and ℧₄₁ is the number of data sources of benchmark data. Additionally, Ω₄₂ denotes the number of categories of training data, while ℧₄₂ is the number of categories of benchmark data.

Therefore, the variousness of training data can be defined as

$${\uptheta }_{4} = \omega_{41} {\uptheta }_{41} + \omega_{42} {\uptheta }_{42} ,$$

(14)

where w₄₁ and w₄₂ signify the connect weights.

5. Equalization: The significant discrepancy in the quantity of samples across different categories can result in discriminatory decision outcomes from the AI models [39]. Thus, to prevent such biases, it is essential to evaluate and ensure the equalization of training data

$${\uptheta }_{5} = [1{ - }(\zeta_{\max } { - }\overline{\zeta })/(\zeta_{\max } { - }\zeta_{\min } )] \times 100{\text{\% }}$$

(15)

where ζ_max and ζ_min are the maximum value of the number of categories respectively.

6. Logicality: The logicality can be utilized to evaluate whether the relationship between features in training data align with factual or commonsense knowledge, which certain data errors may remain undetected during data accuracy assessment [40]. Logical relationships between features encompass comparisons like ‘greater than’, ‘less than’, ‘equal to’, and the like. For example, if feature A and feature B, when multiplied, are expected to be greater than or equal to feature C. If their product falls short of feature C, it indicates a logical inconsistency. Based on the priori knowledge, logicality can be given as

$${\uptheta }_{6} = {{{(}S{ - }\Omega_{ \, 61} )} \mathord{\left/ {\vphantom {{{(}S{ - }\Omega_{ \, 61} )} S}} \right. \kern-0pt} S} \times 100{\text{\% }},$$

(16)

where Ω₆₁ represents the number of samples with logical errors.

7. Fluctuation: To investigate the periodic variation or distribution pattern of training data, fluctuation evaluation can be used to assess the difference of historical samples and latest samples [41]. The evaluation formula for quantifying these fluctuations is expressed as

$${\uptheta }_{7} = \min (1, \, {{\left| {\mho_{71} { - }\Omega_{ \, 71} } \right|} \mathord{\left/ {\vphantom {{\left| {\mho_{71} { - }\Omega_{ \, 71} } \right|} {\mho_{71} ) \times 100\% }}} \right. \kern-0pt} {\mho_{71} ) \times 100\% }},$$

(17)

where ℧₇₁ is the sum of historical samples, while Ω₇₁ signifies the sum of latest samples, |.| denotes an absolute value operation. It is worth noting that, to ensure computational fairness, the number of historical samples should be equal to the number of latest samples.

8. Uniqueness: Repetitive samples have an influence for the outputs of AI models, often causing the overfitting [42]. Nevertheless, simply eliminating these samples from the training data could compromise the generalization ability of the models. Therefore, data uniqueness assessment is required as a basis for deletion without affecting the performance of models

$${\uptheta }_{8} = {{(S{ - }\Omega_{{ \, 8{1}}} )} \mathord{\left/ {\vphantom {{(S{ - }\Omega_{{ \, 8{1}}} )} {S \times 100\% }}} \right. \kern-0pt} {S \times 100\% }},$$

(18)

where Ω₈₁ is the number of repetitive samples in training data.

9. Timeliness: AI models need to be trained with the latest data to ensure optimal performance [43]. Failure to update the data in a timely manner may deprive the model of access to the latest information, ultimately diminishing the accuracy of its predictions or decisions. Therefore, timeliness of data is critical to maintain the effectiveness of the AI model. Timeliness of data includes the distributional shifts of data and the update frequency of data. The distributional shifts of data are given as

$${\uptheta }_{{9,1}} = \min (1, \, 1 - \Theta ) \times 100{\text{\% }}, \,$$

(19)

where

$$\Theta = \frac{1}{100n}\sqrt {\sum\limits_{i = 1}^{i = n} {\left[\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{S}_{i} (t) - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{S}_{i} (t - \tau )\right]^{2} } } ,$$

(20)

Ŝ_i is the average value of the ith feature of dataset at time t, and τ indicates the time interval. Furthermore, the update frequency of data is

$${\uptheta }_{{9, 2}} = {{(T_{B} { - }T_{N} )} \mathord{\left/ {\vphantom {{(T_{B} { - }T_{N} )} {T_{B} }}} \right. \kern-0pt} {T_{B} }} \times 100\% ,$$

(21)

where T_B represents the ideal number of update time period and T_N is the time period that has not been updated. Thus, the timeliness of the data can be described as

$${\uptheta }_{{9}} = \omega_{9,1} {\uptheta }_{{9,1}} + \omega_{9,2} {\uptheta }_{{9,2}} ,$$

(22)

where w₉₁ and w₉₂ are the connection weights.

10. Standard: By standardizing data formats and naming conventions, the readability and comprehensibility of the data are significantly improved. This process can help researchers rapidly familiarize these data, to reduce the complexity and time required for data processing [44]. Furthermore, data reliability can be enhanced by standardizing data source, collection processes, data storage and data training. The standard of data source means that the data source channel is legitimate. Moreover, the standard of the collection processes and data storage means that the data is protected by encryption algorithm such as the differential privacy method and federated learning. The standard of data training refers to the standard in the process of using online or offline data to train AI models, such as the homogeneous computation or heterogeneous computing. Based on the manual judgement, θ₁₀ score of standard can be given as

$$\uptheta_{10} = \upomega_{10,1} \,\uptheta_{10,1} + \upomega_{10,2} \,\uptheta_{10,2} + \upomega_{10,3} \,\uptheta_{10,3} + \upomega_{10,4} \,\uptheta_{10,4} + \upomega_{10,5} \,\uptheta_{10,5} + \upomega_{10,6} \,\uptheta_{10,6,}$$

(23)

where θ_10,1 is the format standard score, θ_10,2 is the naming standard score,θ_10,3, θ_10,4, θ_10,5 and θ_10,6 indicate scores for data source channel, collection process, data storage channel and data training respectively, w₁₀ = [w_10,1,w_10,2,…,w_10,6] represents the weight set.

Remark

Ten key dimensions have been elaborately designed to evaluate data quality. By calculating the score of each dimension, researchers have a clearer understanding for data condition. This approach not only offers valuable guidance but also provides feedback on the effectiveness of training data enhancement. Moreover, the use of simple naming conventions for dimensions, such as serial numbering, simplifies the expansion of dimensions or the addition of sub-item within each dimension in future studies. It is worth noting that to better align with real-world scenarios, certain dimensions incorporate subjective evaluation criteria. In the following discussion, we will explore strategies to minimize the influence of subjectivity.

Comprehensive evaluation for data quality

Based on the dimensions discussed above, data quality will be comprehensively evaluated in this section. In fact, evaluating multi-dimension problems is difficult, due to the potential fuzziness and subjectivity inherent in certain dimensional values. Moreover, all dimensions have to be considered simultaneously, coupled with varying degrees of importance among them, which makes the problem more complicated. Fuzzy comprehensive evaluation model offers a solution. It can transform qualitative evaluation into quantitative evaluation to make an overall evaluation of things or objects constrained by multiple dimensions, based on fuzzy mathematics. Therefore, this model is introduced into the proposed hierarchical evaluation system, to solve difficult-to-quantify and non-deterministic problems, by better achieving a harmonious integration of subjectivity and objectivity criteria. Next, fuzzy comprehensive evaluation model will be described in detail.

Determination of evaluation dimensions and evaluation grades

According to the discussion in Section "Design and calculation of dimensions for data quality evaluation", the dimension: completeness, accuracy, consistency, variousness, equalization, logicality, fluctuation, repeatability, timelines, and standard, can be applied for data quality evaluation. However, training data in different application scenarios should be evaluated using the appropriate dimensions. Thus, the evaluation dimensions can be defined as θ = {θ₁, θ₂, θ₃, …, θ_n}, and n ͼ [1, 10]. The evaluation grades can be given as v = {v₁, v₂, v₃, …,v_m} and m is the number of evaluation grades. For example, when m is four, set v is {excellent, good, pass, poor}.

Construction of evaluation matrix and weight vector

Firstly, the evaluation matrix R, where the membership degree of dimensions in set θ for each evaluation grade in set v, needs to be determined

$${\mathbf{R}} = \left( \begin{gathered} r_{11} , \, r_{12} {, }...{, }r_{1m} \hfill \\ r_{21} , \, r_{22} , \, ..., \, r_{2m} \hfill \\ ... \quad\, ... \,\,\, ... \, \,\,... \, \hfill \\ r_{n1} , \, r_{n2} , \, ..., \, r_{nm} \hfill \\ \end{gathered} \right),{\text{and }}\sum\limits_{j = 1}^{m} {r_{ij} = 1} ,$$

(24)

where r_ij is membership degree of ith dimension for the jth evaluation grade and can be given as by adopting the membership function. It is important to highlight that distinct membership functions should be employed in different scenarios. After the evaluation matrix is calculated, weight vector also needs to be determined.

Each dimension has different roles and positions in the evaluation of data quality. Therefore, the weight vector a = (a₁, a₂, …, a_n), a₁≧0, ∑a_i = 1, need to be designed, which can express the importance of each dimension relative to the problem to be evaluated. Since the requirements for each dimension are discrepant in different scenarios. Analytic hierarchy process (AHP) [45] is suitable for the determination of the weights in this paper, which is an effective analytical method that harmoniously combines qualitative and quantitative methods for solving complex problems with multiple objectives. In order to save space, the working principle of AHP will not be presented while the computational steps will be given in detail in the Section Results and discussion.

Make comprehensive evaluation

For the relationship matrix R and weight vector a, fuzzy transformation result b can be described as

$${\mathbf{b}} = {\mathbf{a}} \circ {\mathbf{R}}$$

(25)

where vector b = [b₁, b₂, …, b_m] and symbolic < ○ > represents the fuzzy operator. Moreover, in order to make full use of information of R, the weighted average operator can be adopted

$$b_{j} = \min \left\{ {1, \, \sum\limits_{i = 1,j = 1}^{i = n,j = m} {\min (a_{i} ,r_{ij} )} } \right\}.$$

(26)

Thus, the comprehensive evaluation result CER is

$${\text{CER}} = \sum\limits_{j = 1}^{m} {\eta_{j} b_{j} } ,$$

(27)

where η_j is the source of jth evaluation grade. Moreover, the detailed evaluation processes are summarized in Table 1.

Table 1 The evaluation process of the proposed MDHES for data quality

Full size table

Remark

The data quality evaluation is complex, because many dimensions to be considered together at the same time, and each dimension has a different level of importance. Therefore, the proposed comprehensive evaluation method, incorporating the fuzzy comprehensive evaluation model, can effectively focus on interactions of dimensions to achieve the dynamic balance. Furthermore, the expert knowledge can be employed in evaluation, and meanwhile the influence of subjectivity in entire evaluation process can be mitigated, by leveraging fuzzy mathematics. Thus, this comprehensive evaluation method will accurately reflect the data quality to increase credibility of AI.

Results and discussion

To further demonstrate the procedures and effectiveness of the proposed MDHES, an example on publicly available data sets: intrusion detection scenario-KDDcup 99 dataset is discussed in this section. The simulations are run on Windows 10.0 operating system with a clock speed 2.6 GHz, 4 GB RAM, and Anaconda 3 development environment with Python 3.6 programming language.

Intrusion detection scenario

Intrusion detection can discover the violations of security policies or signs of attacks in the network and system by collecting and analyzing the operational log information. KDDCup99 dataset was derived from a simulated US Air Force LAN with attacking lasting 7 weeks and can be obtained by https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. The 10_percent_ subset is be adopted in this example. Each sample in subset was labeled as normal and four abnormal categories. Furthermore, anomaly data and another 17 attack types appeared in the test data.

Determination of benchmark data

In order to match the actual scenarios, benchmark data are randomly sampled from the training and testing sets at a ratio of 80%. Additionally, 50% data is randomly sampled from the benchmark data as a validation sample to validate the effectiveness of the AI model (fuzzy neural network, FNN). The distribution details are summarized in Table 2.

Table 2 The distribution details of training data, testing data, benchmark data and validation data

Full size table

Determination of evaluation dimensions

For the training data in intrusion detection scenario, evaluation dimensions are analyzed and used to evaluate, consisting of completeness, accuracy, consistency, variousness, equalization, logicality and uniqueness. Completeness, variousness and equalization can essentially guarantee adequacy and richness of intrusion detection data, thereby increasing the generalization and stability of the AI model. Accuracy, consistency and logicality can ensure that the intrusion data, learned by AI models during training, is real and effective, thereby improving the accuracy of the model in identifying intrusion behaviors. Moreover, the fluctuation, timeliness and standard are not evaluated in this experiment. The reasons are as follows:

Fluctuation: This intrusion detection data does not conform to periodic change. Thus, it is not possible to perform the fluctuation assessment.

Timeliness: This training data, although from 1999, provides a foundation for research on network intrusion detection based on computational intelligence. However, according to the requirements of timeliness, the evaluation scores would be very low. Therefore, the timeliness evaluation is not done in this experiment, to avoid the impact on the final evaluation results.

Standard: The standard of training data cannot be assessed, because this data acquisition process and storage channel is not known.

Evaluation results and analyse for intrusion detection

Dimension evaluation for training data: Based on the completeness, accuracy, consistency, variousness, equalization, logicality and repeatability, KDDCup99 intrusion detection data was evaluated with equal weights for each evaluation dimensions and the details are summarized in Table 3. For completeness of training data, obviously, the scores of features and feature values are 100. The score of data size is the 76.71 with the confidence level 4%, thus the completeness is the 92.24. In addition, the accuracy is 96, where 18,612 anomalous data are filtrated according Eq. (6). No contradiction has been discovered between training data and baseline data by comparing each sample, then the score of consistency is 100. Due benchmark data are randomly selected from the training and testing sets, it means that multi-source score of training data is 50. Moreover, there are 23 categories of training data, and 39 categories of benchmark data, the diversity is 56. Therefore, the variousness score is the 53. Intuitively, there is a significant difference in the number of categories of training data and the equalization score is only 1 according to Eq. (13). It can be determined that logicality score is 100 by calibration. Additionally, there are 348,435 duplicate entries in the training data, and the score for uniqueness is calculated to be 30.

Table 3 The Evaluation results of each dimension for training data of intrusion detection

Full size table

Comprehensive evaluation for training data: Based on the above discussion of each dimension, data quality can be comprehensively evaluated in this section. In our experiment, data quality is divided into four evaluation grades: poor, pass, good, excellent. The evaluation matrix R, where the membership degree of dimensions for each evaluation grade, needs to be calculated. First, membership function will be determined. For the “poor” grade, the lower the score of dimension, the greater the degree of belonging to it. Thus, “poor” should be a gradually decreasing function. Similarly, “excellent” is a gradually rising function. Both “pass” and “good” are functions that go up and then go down. Based on the above analysis, trapezoidal membership function is adopted with a characteristic of simple calculation. The distribution can be given as

$$\begin{gathered} r_{i1} = \left\{ \begin{gathered} 1, \quad x \le \delta_{1} \hfill \\ {{(\delta_{2} - x)} \mathord{\left/ {\vphantom {{(\delta_{2} - x)} {(\delta_{2} - \delta_{1} ),\delta_{1} < x < \cdot\delta_{2} }}} \right. \kern-0pt} {(\delta_{2} - \delta_{1} ),\delta_{1} < x < \cdot\delta_{2} }} \hfill \\ 0, \quad x \ge \delta_{2} \hfill \\ \end{gathered} \right., \, r_{i2} = \left\{ \begin{gathered} {{(x - \delta_{1} )} \mathord{\left/ {\vphantom {{(x - \delta_{1} )} {(\delta_{2} - \delta_{1} ),\delta_{1} < x < \delta_{2} }}} \right. \kern-0pt} {(\delta_{2} - \delta_{1} ),\delta_{1} < x < \delta_{2} }} \hfill \\ {{(\delta_{3} - x)} \mathord{\left/ {\vphantom {{(\delta_{3} - x)} {(\delta_{3} - \delta_{2} ),\delta_{2} < x < \delta_{3} }}} \right. \kern-0pt} {(\delta_{3} - \delta_{2} ),\delta_{2} < x < \delta_{3} }} \hfill \\ 0, \quad x \le \delta_{1} {\text{ or }}x \ge \delta_{3} \hfill \\ \end{gathered} \right., \hfill \\ \hfill \\ r_{i3} = \left\{ \begin{gathered} {{(x - \delta_{2} )} \mathord{\left/ {\vphantom {{(x - \delta_{2} )} {(\delta_{3} - \delta_{2} ),\delta_{2} < x < \delta_{3} }}} \right. \kern-0pt} {(\delta_{3} - \delta_{2} ),\delta_{2} < x < \delta_{3} }} \hfill \\ {{(\delta_{4} - x)} \mathord{\left/ {\vphantom {{(\delta_{4} - x)} {(\delta_{4} - \delta_{3} ),\delta_{3} < x < \delta_{4} }}} \right. \kern-0pt} {(\delta_{4} - \delta_{3} ),\delta_{3} < x < \delta_{4} }} \hfill \\ 0, \quad x \le \delta_{2} {\text{ or }}x \ge \delta_{4} \hfill \\ \end{gathered} \right., \, r_{i4} = \left\{ \begin{gathered} 0, \quad x \le \delta_{3} \hfill \\ {{(x - \delta_{3} )} \mathord{\left/ {\vphantom {{(x - \delta_{3} )} {(\delta_{4} - \delta_{3} ),\delta_{3} < x \le \delta_{4} }}} \right. \kern-0pt} {(\delta_{4} - \delta_{3} ),\delta_{3} < x \le \delta_{4} }} \hfill \\ 1, \quad x \ge \delta_{4} \hfill \\ \end{gathered} \right., \hfill \\ \end{gathered}$$

(28)

where δ₁, δ₂, δ₃, and δ₄ are the critical values at the four levels of poor, pass, good, excellent, respectively.

For the intrusion detection scenario, the completeness, accuracy, consistency and logicality are critical, which will have a direct impact on the FNN model. However, equalization is hard to guarantee, due to the difficulty and the cost of attacking. Uniqueness is also difficult to obtain satisfactory results, because of repeated attack means in this scenario. Therefore, the grade distribution of each dimension is as follows.

According to Table 4, the membership functions of dimensions can be determined. For examples, membership functions of completeness for four grades are

$$\begin{gathered} r_{i1} = \left\{ \begin{gathered} 1, \quad x \le 80 \hfill \\ {{(85 - x)} \mathord{\left/ {\vphantom {{(85 - x)} {(85 - 80),80 < x < 85}}} \right. \kern-0pt} {(85 - 80),80 < x < 85}} \hfill \\ 0, \quad x \ge 85 \hfill \\ \end{gathered} \right., \, r_{i2} = \left\{ \begin{gathered} {{(x - 80)} \mathord{\left/ {\vphantom {{(x - 80)} {(85 - 80),80 < x < 85}}} \right. \kern-0pt} {(85 - 80),80 < x < 85}} \hfill \\ {{(95 - x)} \mathord{\left/ {\vphantom {{(95 - x)} {(95 - 85),85 < x < 95}}} \right. \kern-0pt} {(95 - 85),85 < x < 95}} \hfill \\ 0, \quad x \le 80{\text{ or }}x \ge 95 \hfill \\ \end{gathered} \right., \hfill \\ \hfill \\ r_{i3} = \left\{ \begin{gathered} {{(x - 85)} \mathord{\left/ {\vphantom {{(x - 85)} {(95 - 85),85 < x < 95}}} \right. \kern-0pt} {(95 - 85),85 < x < 95}} \hfill \\ {{(98 - x)} \mathord{\left/ {\vphantom {{(98 - x)} {(98 - 95),95 < x < 98}}} \right. \kern-0pt} {(98 - 95),95 < x < 98}} \hfill \\ 0, \quad x \le 85{\text{ or }}x \ge 98 \hfill \\ \end{gathered} \right., \, r_{i4} = \left\{ \begin{gathered} 0, \quad x \le 85 \hfill \\ {{(x - 95)} \mathord{\left/ {\vphantom {{(x - 95)} {(98 - 95),95 < x \le 98}}} \right. \kern-0pt} {(98 - 95),95 < x \le 98}} \hfill \\ 1, \quad x \ge 98 \hfill \\ \end{gathered} \right., \hfill \\ \end{gathered}$$

(29)

thus, the fuzzy set of completeness is [0, 0.276, 0.724, 0]. Sequentially, evaluation matrix R can be given

$${\mathbf{R}} = \left[ {\begin{array}{*{20}c} {0,} & {0.276,} & {0.724,} & 1 \\ {0,} & {0,} & {0.67,} & {0.33} \\ {0,} & {0,} & {0,} & 1 \\ {0.55} & {0.45} & {0,} & 0 \\ {1,} & {0,} & {0,} & 0 \\ {0,} & {0,} & {0,} & 1 \\ {0.5,} & {0.5,} & {0,} & 0 \\ \end{array} } \right],$$

(30)

Table 4 The grade distribution of each dimension in intrusion detection scenario

Full size table

Next, the weight vector a will be depicted by adopting AHP. In this AHP, judgment matrix C is first constructed and the element c_ij of C are given using 1–9 scale method proposed by Saaty.

Based on Table 5, the judgment matrix C, by a two-by-two comparison of completeness, accuracy, consistency, variousness, equalization, logicality and repeatability, can be defined as

$${\mathbf{C = }}\left[ {\begin{array}{*{20}c} {1,} & {1/4,{\text{ }}} & {1/6,{\text{ }}} & {0.5,} & {7,} & {1/4,} & 6 \\ {4,} & {1,} & {1/3,} & {5,} & {9,} & {1,} & 9 \\ {6,} & {3,} & {1,} & {5,} & {9,} & {1,} & 9 \\ {2,} & {1/5,} & {1/5,} & {1,} & {5,} & {1/6,} & 5 \\ {1/7,} & {1/9,} & {1/9,} & {1/5,} & 1 & {1/8,} & {1/2} \\ {4,} & {1,} & {1,} & {7,} & {8,} & {1,} & 8 \\ {1/6,} & {1/9,} & {1/9,} & {1/5,} & {2,} & {1/9,} & 1 \\ \end{array} } \right],$$

(31)

$$\begin{gathered} CR{ = }{{CI} \mathord{\left/ {\vphantom {{CI} {RI,}}} \right. \kern-0pt} {RI,}} \hfill \\ {{CI{ = (}\lambda_{\max } { - }\tau )} \mathord{\left/ {\vphantom {{CI{ = (}\lambda_{\max } { - }\tau )} {(\tau { - }1),}}} \right. \kern-0pt} {(\tau { - }1),}} \hfill \\ \end{gathered}$$

(32)

where CR is the consistency ratio, λ_max is the absolute value of the largest eigenvalue of the matrix C and τ is the number of non-zero eigenvalues order. For matrix C, λ_max is 7.588, τ is the 7. Additionally, when the eigenvalue is 7, the corresponding RI is 1.32, which is directly given by Saaty. Therefore, value of CR is 0.074 and passes the consistency testing. The maximum vector of the maximum eigenvalue λmax is [– 0.15, – 0.43, – 0.69, – 0.17, – 0.04, – 0.53, – 0.048]. Then, after normalization, the weight vector is [0.18, 0.25, 0.25, 0.04, 0.04, 0.12, 0.12]. Therefore, the fuzzy transformation result is [0.1, 0.1, 0.47, 0.31]. The scores corresponding to four evaluation grades are [40, 60, 80, 90]. Hence, the comprehensive evaluation result of training data of intrusion detection is 75.5.

Table 5 1–9 scale method

Full size table

In addition, to better compare the advantages of the fuzzy comprehensive evaluation method, it is compared with the weighted average evaluation method, where each dimension is equally weighted. The calculated score for data quality is 71.2. In this scenario, the score is 4 points lower, compared to the score of fuzzy comprehensive evaluation. However, the fuzzy comprehensive evaluation method produces the final evaluation result by setting unequal weights for each evaluation dimension. This weight setting method can reflect the relative importance between the evaluation dimension, making the results more scientific and reasonable. Besides, it can effectively deal with various vague and uncertain information. These advantages make the fuzzy comprehensive evaluation method prominent and become one of the effective means to solve practical evaluation problems.

Based on the above discussion, equalization and uniqueness are the main reasons for causing the unsatisfactory score of data quality. Therefore, some data processing is required for the data quality improvement. After de-duplication and simple random sampling [46], the distributions of all data sets are shown in Figs. 2, 3. It can be seen that, to improve the equalization, the number of category R2L is decreased, while other categories are increased. Moreover, the evaluation results of improved training data are summarized in Table 5 and the comparison with the raw training data is displayed in Fig. 4.

Based on Fig. 4 and Table 6, the accuracy is improved by 2% comparing with the raw training data. Additionally, the scores of variousness, equalization and uniqueness have been improved by at least 50%. It is worth noting that we don’t completely remove duplicates, to maintain a balance between the equalization and uniqueness, due to the limitations in benchmark database. The comprehensive score of the processed training data is 89.2, whose quality has significantly improved.

Table 6 The evaluation results of each dimension for processed training data of intrusion detection

Full size table

In order to further demonstrate the impact of training data improvement on model performance, FNN will be used in the validation data. Moreover, the simulation results of FNN on the processed training data (FNN-P) are compared with FNN on the raw training data (FNN-R), multi-level hybrid support vector machine and extreme learning machine based on modified k-means (SVM + ELM + MK) [47], back propagation (BP) network, improved long-short term memory (ILSTM) [48], and improved sparse denoising autoencoder (ISSDA) [49]. To make a fair comparison, the accuracy T_A, false positive rate T_F, and false negative rate T_M are introduced. The results are displayed in Figs. 5–7 and the details are shown in Table 7.

Table 7 Performance comparison of different AI models for intrusion detection

Full size table

The accuracy of different AI models is displayed in Fig. 5. The false positive rate, and false negative rate are shown in Figs. 6 and 7. Combining with the Table 7, it can be seen that FNN-P has the significant improvement on all categories, whose performance is shown in bold in Table 7, such as the accuracy of ‘Normal’ is improved by approximately 8% and the false rate of ‘Dos’ is reduced to 3.6%, comparing with the FNN-R. It indicates that data quality plays a crucial role in model performance improvement. Moreover, FNN-P performs better than the meticulously designed SVM + ELM + MK, especially in categories ‘U2R’ and ‘R2L’. In addition, the effectiveness of FNN-P can be comparable to that of improved deep networks (ILSTM and ISSDA). Based on the above analysis, the proposed MDHES has a magnitude significance. The strengths and weaknesses of intrusion detection can be clearer understood by quantifying the score of each dimension, which will provide the guidance and feedback on the data quality improvement for the researchers. Furthermore, the proposed comprehensive evaluation method can achieve the dynamic balance of dimensions to avoid the resource waste for excessive pursuing a certain dimension improvement, such that the score equalization is only 58, FNN still achieves good performance.

Practical application of multi-dimension comprehensive evaluation system

In recent years, cyber-telecoms fraud has been a seriously social problem, threatening the property safety of the people, with the characteristics of frequent occurrence, rapid increase, and repeated prohibition. Deep neural networks (DNNs) are the powerful tool to effectively distinguish the fraudulent behaviour. However, this tool is highly questionable, because that the effects are unable to determine, which may cause the unbearable consequences. Therefore, the proposed MDHES are investigated from the data quality aspect for intelligent identification of cyber-telecoms fraud to enhance the credibility of DNNs.

Fraud data from four provincial companies in China has been obtained to form a benchmark database after data processing (deduplication, denoising, and complement of missing values, etc.), which will be encrypted by the differential privacy way and uploaded to the data management library of this proposed evaluation system. The data features can be summarized into three categories: network management data, signaling data, and call ticket data. The network management data contains the IP address, MAC address, port number and so on. Signaling data is some location related information. The features of call ticket data include the number of calls, total call duration, maximum call duration, minimum call duration and last call duration. The important features of this benchmark database are 32 after selection, where the some features are very sensitive and required not to disclose publicly. Moreover, the number of total data is 500 million, contains the approximately 400 million normal samples and 100 million fraud samples. There are mainly types of fraud: traditional fraud (gambling, pornography, pyramid schemes) and new fraud (investment and financial management, pig killing, loans, brushing orders, counterfeit industry and commerce, counterfeit public security, procurator and judicial authorities, counterfeit customer service and so on).

In this scenario, we adopt nine dimensions, except the fluctuation due to the irregular occurrence of fraud activities, to evaluate the training data. Firstly, uploading the training data file with the 40 million samples to the proposed MDHES (seen as the Fig. 8). Then, Clicking the ‘Create’ button, evaluation task can be created. Next, nine dimensions are selected, and the task starts by clicking the ‘Start’ button. The evaluation results are shown in Fig. 9. It can be seen that the data size in completeness is 58.16 and the variousness is 75. Therefore, the dimensions: completeness, consistency, uniqueness and timeliness have unsatisfactory scores, which needs to be further improved and repeat the evaluation process. Additionally, in order to clearly demonstrate the changes in the scores of the raw and improved data samples, the comparisons are displayed in Fig. 10, and the details are summarized in Table 7.

It can be seen the completeness of training data is improved by 10% in Fig. 10 and Table 8. Consistency and timeliness are two important dimensions for fraud identification DNNs, due to the replacement of objects. For example, the previously normal URL has become a malicious URL, which will lead to a contradiction of label value. Therefore, the consistency is enhanced by correcting contradiction samples based on the latest fraud data. Of course, the timeliness is also increased. Moreover, the comprehensive score for data quality is 92.6, which is enhanced by approximately 8.3%, comparing with the original training data. To reveal the impact of data quality change on DNNs, SDA and LSTM are adopted, where the basic building units of SDA are 32, and the basic units of LSTM are 21. The performances on the validation data are shown in Fig. 11.

Table 8 The evaluation comparisons of the raw and improved data samples for telephone network fraud

Full size table

As shown in Fig. 11, the average accuracy, average false positive rate, average and false negative rate of DNNs have the positive improvement. The average accuracy of DNNs is 89.46, which has increased by 13% comparing to the original data. Furthermore, the average T_F is reduced to 8.19% and the average T_M is also 2.61%. Based on the evaluation results and performance, this intelligent identification model meets the actual application requirements and is deployed into the business of cyber-telecoms fraud prevention. The actual application is shown in Fig. 12.

It can be seen that the total number of fraudulent websites today, the total number of fraudulent websites within a week, the number of interceptions, and the total number of interceptions within a week. Moreover, the top twelve types of fraudulent websites are given, and at the same time, blocking websites with higher rankings are also displayed. In the first quarter of 2024, the cyber-telecoms fraud prevention system effectively identified 3 million + fraudulent websites and blocked 46 billion + illegal visits, to effectively reduce the incidence of fraud and protect citizens’ lives and properties.

Conclusion

In this paper, a MDHES is proposed to estimate the data. In this evaluation system, multiple crucial dimensions are designed to evaluate data condition separately, which can provide a clearer understanding for improvement. Then, a comprehensive evaluation method, incorporating a fuzzy evaluation model, is developed to synthetically evaluate the data quality to achieve the dynamic balance and meanwhile mitigates the impact of subjectivity on the comprehensive evaluation result. Finally, experiment results and comparisons, in intrusion detection benchmark problem and real intelligent application identification of cyber-telecoms fraud, demonstrate the effectiveness of the proposed MDHES, which achieves an accurately and thoroughly data quality assessment to provide strong data support for trustworthy AI. In addition, when there are multiple dimensions (more than 9), the workload of AHP scaling is too large, which can easily cause confusion in judgment. Therefore, in future research, on the hand, we will focus on improving the comprehensive evaluation method to better assess the quality of data. On the other hand, more dimensions and sub-items for the big model will be considered to improve our evaluation system for promoting the application of AI in more practical scenarios in a safe and efficient manner.

Data availability

KDDCup99 dataset was derived from a simulated US Air Force LAN with attacking lasting 7 weeks and can be obtained by https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Fraud dataset is owned by a third party. The data underlying this paper were provided by [third party] under licence/ by permission. data will be shared on request to the corresponding author with permission of [third party].

Abbreviations

AI:: Artificial intelligence
MDHES:: Multi-dimensional hierarchical evaluation system
FDA:: Fuzzy denoising autoencoder
BANG:: Batch adjusted network gradients
SPAM:: Scalable polynomial additive model
SE-BPER:: Semantic-enhanced bayesian personalized explanation ranking
AutoML:: Automated machine learning
FNN:: Fuzzy neural network
AHP:: Analytic hierarchy process
FNN-P:: Processed training data
FNN-R:: Raw training data
SVM +ELM+MK:: Support vector machine and extreme learning machine based on modified k-means
BP:: Back propagation
ILSTM:: Improved long-short term memory
ISSDA:: Improved sparse denoising autoencoder

References

Ahmed I, Jeon G, Piccialli F. From artificial intelligence to explainable artificial intelligence in industry 4.0: a survey on what, how, and where. IEEE Trans Ind Inform. 2022;18(8):5031–42.
Article Google Scholar
Putra MA, Ahmad T, Hostiadi DP. B-CAT: a model for detecting botnet attacks using deep attack behavior analysis on network traffic flows. J Big Data. 2024;11(1):49.
Article Google Scholar
Zhang HJ, He S, Chen J. A hierarchical authentication system for access equipment in internet of things. Int J Intell Syst. 2023;1:1–11.
Google Scholar
Furman J, Seamans R. AI and the economy. Innov Policy Econ. 2019;19(1):1–191.
Google Scholar
Yu KH, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng. 2018;2:719–31.
Article Google Scholar
Khan A. Role of artificial intelligence in car-following and lane change models for autonomous driving. Adv Hum Asp Transp. 2018;9:307–17.
Google Scholar
Schetinin V, Li D, Maple C. An evolutionary-based approach to learning multiple decision models from underrepresented data. In: Schetinin V, editor. 2008 Fourth international conference on natural computation, vol. 1. Jinan: IEEE; 2008.
Google Scholar
Vavra P, Baar JV, Sanfey A. The neural basis of fairness. Interdiscip Perspect Fairness Equity Justice. 2017;5:9–31.
Article Google Scholar
Liu HC, Wang YQ, Fan WQ, et al. Trustworthy AI: a computational perspective. ACM Trans Intell Syst Technol. 2022;14(1):1–59.
Article Google Scholar
Chatila R, Dignum V, Fisher M, et al. Trustworthy AI. Reflect Artif Intell Hum. 2021;12600:13–39.
Google Scholar
Malchiodi D, Raimondi D, Fumagalli G, et al. The role of classifers and data complexity in learned bloom flters: insights and recommendations. J Big Data. 2024;11(45):1–26.
Google Scholar
Han HG, Zhang HJ, Qiao JF. Robust deep neural network using fuzzy denoising autoencoder. Int J Fuzzy Syst. 2020;22(6):1356–75.
Article Google Scholar
Rozsa A, Gunther M, Boult TE. Towards robust deep neural networks with BANG. In: IEEE winter conference on applications of computer vision (WACV)2018.
Yampolskiy R. Unexplainability and Incomprehensibility of AI. J Artif Intell Conscious. 2020;7(2):1–15.
Article Google Scholar
Guidotti R, Monreale A, Ruggieri S, et al. A survey of methods for explaining black box models. ACM Comput Surv. 2019;51(5):1–42.
Article Google Scholar
Dubey A, Radenovic F, Mahajan D. Scalabl interpretability via polynomials. Neural Inform Process Syst. 2022;1:1–26.
Google Scholar
Meng ZL, Wang MH, Bai JJ, et al. Interpreting deep learning-based networking systems. IEEE Commun Surv Tutor. 2019;21(3):2702–33.
Google Scholar
Nwafor O, Okafor E, Aboushady AA, et al. Explainable artificial intelligence for prediction of non-technical losses in electricity distribution networks. IEEE Access. 2023;11:73104–15.
Article Google Scholar
McClure P, Moraczewski D, Lam KC, et al. Improving the interpretability of fMRI decoding using deep neural networks and adversarial robustness. Apert Neuro. 2023;3:1–17.
Google Scholar
Fernandes FE, Yen GG. Automatic searching and pruning of deep neural networks for medical imaging diagnostic. IEEE Trans Neural Netw Learn Syst. 2021;32(12):5664–74.
Article Google Scholar
Barreiro E, Munteanu CR, Monteagudo MC, et al. Net–net auto machine learning (AutoML) prediction of complex ecosystems. Sci Rep. 2018;8(12340):2685–96.
Google Scholar
Elliott A. What data scientists tell us about AI model training today, Alegion, 2019; 1–10.
Forrester Consulting. Overcome obstacles to get to AI at scale. IBM. 2020; 1–12.
Kortylewski A. Analyzing and reducing the damage of dataset bias to face recognition with synthetic data. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2019; pp. 2261–2268.
Jackson A, The state of open data science 2020, Digital Science. 2020: 1–30.
Andrew NG A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. 2022. https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning.
Zhang D, Lai H. Data-centric artificial intellgence: a survey. 2023; 1–39.
Zhu SC. Making mathematical models for the humanities: Chinese thought from the perspective of artificial general intelligence. J Mod Stud. 2024;3(1):42–66.
Google Scholar
Liang WX, Tadesse GA, Ho D, et al. Advances, challenges and opportunities in creating data for trustworthy AI. Nat Mach Intell. 2022;4:669–77.
Article Google Scholar
Artamonov I, Deniskina A, Filatov V, et al. Quality management assurance using data integrity model. Matec Web Conf. 2019. https://doi.org/10.1051/matecconf/201926507031.
Article Google Scholar
Caballero I, Serrano M, Piattini M. A data quality in use model for big data. Adv Concept Model. 2014;8823:65.
Article Google Scholar
Cai L, Zhu YY. The challenges of data quality and data quality assessment in the big data era. Data Sci J. 2015;14(2):78–92.
Google Scholar
Hongxun T, Honggang W, Kun Z. Data quality assessment for on-line monitoring and measuring system of power quality based on big data and data provenance theory. In: Hongxun T, editor. 2018 IEEE 3rd international conference on cloud computing and big data analysis. Chengdu: IEEE; 2018. p. 248–52.
Google Scholar
Cai L, Zhu YY. The challenges of data quality and data quality assessment in the big data era. Data Sci J. 2015;14:69–87.
Article Google Scholar
Barchard KA, Verenikina Y. Improving data accuracy: selecting the best data checking technique. Comput Hum Behav. 2013;29(5):1917–22.
Article Google Scholar
Li ZT, Sun JB, Yang KW, Xiong DH. A review of adversarial robustness evaluation for image classification. J Comput Res Dev. 2022;59(10):2164–89.
Google Scholar
Khalfi B, de Runz C, Faiz S, Akdag H. A new methodology for storing consistent fuzzy geospatial data in big data environment. IEEE Trans Big Data. 2021;7(2):468–82.
Article Google Scholar
Wang S, Yao X. Relationships between diversity of classification ensembles and single-class performance measures. IEEE Trans Knowl Data Eng. 2013;25(1):206–19.
Article Google Scholar
Chae JH, Jeong YU, Kim S. Data-dependent selection of amplitude and phase equalization in a quarter-rate transmitter for memory interfaces. IEEE Trans Circuits Syst. 2020;67(9):2972–83.
Article Google Scholar
Yao W. Research on static software defect prediction algorithm based on big data technology. In: Yao W, editor. 2020 International conference on virtual reality and intelligent systems (ICVRIS). Zhangjiajie: IEEE; 2020. p. 610–3.
Google Scholar
Kim KY, Park BG. Effect of random dopant fluctuation on data retention time distribution in DRAM. IEEE Trans Electron Devices. 2021;68(11):5572–7.
Article Google Scholar
Widad E, Saida E, Gahi Y. Quality anomaly detection using predictive techniques: an extensive big data quality framework for reliable data analysis. IEEE Access. 2023;11:103306–18.
Article Google Scholar
Xia Q, Xu Z, Liang W, Yu S, et al. Efficient data placement and replication for QoS-aware approximate query evaluation of big data analytics. IEEE Trans Parallel Distrib Syst. 2019;30(12):2677–91.
Article Google Scholar
Lee D. Big data quality assurance through data traceability: a case study of the national standard reference data program of Korea. IEEE Access. 2019;7:36294–9.
Article Google Scholar
Ge Z, Liu Y. Analytic hierarchy process based fuzzy decision fusion system for model prioritization and process monitoring application. IEEE Trans Industr Inf. 2019;15(1):357–65.
Article Google Scholar
Antal E, Tillé Y. Simple random sampling with over-replacement. J Stat Plann Inference. 2011;141(1):597–601.
Article MathSciNet Google Scholar
Al-Yaseen WL, Othman ZA, Nazri MZA. Multi-level hybridsupport vector machine and extreme learning machine based on modified K-means for intrusion detection system. Expert Syst Appl. 2017;67:296–303.
Article Google Scholar
Zhang L, Yan H, Zhu Q. An improved LSTM network intrusion detection method. In: Zhang L, editor. 2020 IEEE 6th international conference on computer and communications (ICCC). Chengdu: IEEE; 2020.
Google Scholar
Guo XD, Li XM, Jing RX, et al. Intrusion detection based on improved sparse denoising autoencoder. J Comput Appl. 2019;39(3):769–73.
Google Scholar

Download references

Acknowledgements

Zhang Yusheng is acknowledged for his consulting assistance and silent accompany provided to the first author of this manuscript.

Funding

This work was supported by Research and standardization of key technologies for 6G general computing and intelligent integration R241149BC03, Research on 6G Trusted Endogenous Security Architecture and Key Technologies Grant R24113V7, and Research and Standardization of Key Technologies for 6G General Computing and Intelligent Integration Grant R241149B.

Author information

Authors and Affiliations

Research Institute of China Mobile Communications Corporation, Beijing, 100032, China
Hui-Juan Zhang, Can-Can Chen, Peng Ran, Kai Yang, Quan-Chao Liu, Zhe-Yuan Sun, Jia Chen & Jia-Ke Chen
Future Institute, Beijing, 100032, China
Hui-Juan Zhang, Can-Can Chen, Quan-Chao Liu, Zhe-Yuan Sun & Jia Chen
Research Institute of Safety Technology, Beijing, China
Hui-Juan Zhang, Peng Ran, Kai Yang & Jia-Ke Chen

Authors

Hui-Juan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Can-Can Chen
View author publications
You can also search for this author in PubMed Google Scholar
Peng Ran
View author publications
You can also search for this author in PubMed Google Scholar
Kai Yang
View author publications
You can also search for this author in PubMed Google Scholar
Quan-Chao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhe-Yuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jia Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jia-Ke Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.J. is the main designer of the proposed evaluation system and was also a major contributor in writing the manuscript. C.C., P.R. and K.Y. coordinates with other parties to obtain and analyze the fraud data. Z.Y., J.C., Q.C. and J.K. are mainly responsible for the web development of this multi-dimensional comprehensive evaluation system and real-time display screen of cyber-telecoms fraud prevention. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hui-Juan Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, HJ., Chen, CC., Ran, P. et al. A multi-dimensional hierarchical evaluation system for data quality in trustworthy AI. J Big Data 11, 136 (2024). https://doi.org/10.1186/s40537-024-00999-2

Download citation

Received: 11 April 2024
Accepted: 14 September 2024
Published: 27 September 2024
DOI: https://doi.org/10.1186/s40537-024-00999-2

A multi-dimensional hierarchical evaluation system for data quality in trustworthy AI

Abstract

Introduction

Multi-dimensional hierarchical evaluation system

Design and calculation of dimensions for data quality evaluation

Remark

Comprehensive evaluation for data quality

Determination of evaluation dimensions and evaluation grades

Construction of evaluation matrix and weight vector

Make comprehensive evaluation

Remark

Results and discussion

Intrusion detection scenario

Determination of benchmark data

Determination of evaluation dimensions

Evaluation results and analyse for intrusion detection

Practical application of multi-dimension comprehensive evaluation system

Conclusion

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords