Ramifications of incorrect image segmentations; emphasizing on the potential effects on deep learning methods failure

Detecting failure cases is critical to ensure a secure self-driving system. Any flaw in the system directly results in an accident. In genuine class, the model’s probability reflects better-reflected model confidence. As a result, the confidence distributions of failed predictions were changed to lower values. In contrast, accurate predictions were remained associated with high values, allowing for considerably more excellent separability between such prediction types. The study investigates the association of ramifications with computational color constancy that can negatively influence CNN’s image classification and semantic segmentation. Image datasets were used to conduct different scales and complexity experiments. For instance, minimal and straightforward images of digits were comparatively provided through MNIST and SVHN datasets. The dataset’s standard validation set was employed to test and compute additional metrics because ground truth that is not publicly available for some test sets. The results depicted that baseline methods were outperformed through the proposed approach with a considerable variant on minimal datasets or models in every context. Therefore, Transmission Control Protocol (TCP) is appropriate in failure prediction, and ConfidNet is competent to be fulfilled as confidence criterion. Further, one of the solutions would be to elevate the validation set size, but this would influence the prediction performance of a failure model. On the contrary, the confidence estimation was based on models with test predictive performance levels, similar to baselines. The gap between validation accuracy and training accuracy was significant on CIFAR-100, which indicates the modest enhancement for failure detection via the validation set.

in these domains. Most high-risk applications have developed legacy procedures that can perform the task, such as human professionals making a classification [2]. A critical element to maintain trust in a model's performance is developing estimates in the prediction confidence that emphasize the accurately anticipated accuracy of that sample [3]. This would facilitate a practitioner to not better comprehend the opportunity of the model forecasting incorrectly on a per-sample basis but also likely utilize that estimate for determining when to default to the legacy procedure. There are two core uses for estimating the prediction confidence [4]. Some applications require the confidence estimate directly as a model output, which is utilized in the next phase of the decision-making procedure [5]. Such applications need to represent the expected sample accuracy through confidence estimate and confirm the probability's natural interpretation [6].
Deep neural networks have observed a greater acceptance, led by their significant performance in different tasks such as object recognition, natural language processing, speech recognition, and image classification [7][8][9][10][11]. Despite their growing success, safety is a significant issue in integrating such models in real-world circumstances [11]. Estimating a model error in applications where failure leads to extreme repercussions becomes more crucial, including nuclear power plant monitoring, medical diagnosis, or autonomous driving [12]. In this regard, failure prediction was addressed with deep neural networks [13][14][15]. From a classification viewpoint, a widely used benchmark had taken the value of the forecasted class's probability, such as the Maximum Class Probability (MCP). MCP for failure prediction still experiences different conceptual limitations even though recent assessments indicate significant performances with deep models [16].
Indeed, SoftMax probabilities are classified as non-calibrated, inadequate to detect distribution examples, and sensitive to adversarial attacks [17]. Another critical concern associated with MCP is based on confidence scores ranking, which is unrealistic for the failure prediction task [18]. The issue must arise because MCP drives by designing toward high confidence values, even for flawed ones. However, the likelihood of the model shows a better-reflected model confidence in terms of true class. This drives to fails' confidence distributions transformed to lesser values, whereas accurate predictions were still related with high values, which allows a much better separability between such prediction types. Therefore, this paper presents a failure prediction model with deep neural networks by introducing a new confidence criterion based on using the Transmission Control Protocol (TCP) in terms of failure prediction to offer theoretical confirmations.

Learning model confidence for failure prediction
Deep neural networks were used to define appropriate confidence criteria for predicting failed cases, specifically in the case of classification. Semantic image segmentation was further considered, which can be observed as a pixel-wise classification issue, where a dense segmentation mask was reported through a predicted class model allocated to each pixel. In particular, all the following material was developed for classification, and integration details were provided where needed.
x i ∈ R d is a d-dimensional characteristic and y * i ∈ γ = {1, . . . ., K } is its actual class. A classification neural network was viewed as a probabilistic model undertaking an input x , parameters of the network w , and the network allocating a probabilistic predictive distribution for each class k . The model can predict the class as y = argmaxp(Y = k|w, x).
Network parameters were obtained following a maximum likelihood estimation model throughout training, where one reduces the Kullback-Leibler (KL) divergence between the actual and predictive distribution. This is comparative to minimize the cross-entropy loss, in classification, concerning w , which is the negative sum of the log-probabilities over positive labels:

Using confidence estimates
Well-calibrated confidence estimates become progressively important since deep learning models get integrated with real-world decision-making systems where the cost of misclassification is high. A confidence estimate is well-calibrated if a sufficiently closer estimate to the probability of that input being accurately classified. For accurate classification, a probability estimate was obtained by obtaining the sample average preciseness of all data points with similar attributes. A grouping can be done on similar inputs in circumstances where there are few data points with similar characteristics. The confidence estimate uses particular applications from a discriminative model as an input to the next phase of the decision-making process.
By learning mapping to a well-calibrated probability from prediction scores. T-scaling, short for temperature scaling, is a specific example of Platt calibration in which the logit score of a classifier is divided by a scalar T. [12] discovered T-scaling to be the most successful and most straightforward calibration approach in a thorough examination of calibration methods. Because T-scaling does not influence prediction rank-order, it only affects the Brier error, anticipated calibration error, instead of the label error. Calibration parameters are fitted to the validation set, identical to the training set. Calibration does not directly address unfamiliar samples, but our studies indicate that calibration is critical for providing appropriate confidence estimates on both known and unfamiliar data.
Jiang et al. [19] determines the continuity of various types of therapy using ICU mortality calculators for confidence estimates. It becomes essential to obtain the similar intuitive meaning individuals would anticipate because the next step is often determined on that assumption. The overall estimated probability distribution can be utilized as an input for another model across all possible classes instead of comparing the confidence estimate of the predicted class to a threshold. An interpretable probability estimate is needed if a human expert recommends that value. However, confidence estimates can be used to determine whether for trusting the predictions of a model in link with a threshold. This can effortlessly be utilized in the example of automated medical diagnoses since the model can depend on a professional for inputs that cannot be estimated with adequate confidence [19]. The model should merely be used when the user can trust the accuracy of its prediction since the legacy process can be treated as the expert's prediction receiving a diagnosis from a doctor, and there might be a high cost for imprecise diagnoses. -In this regard, the confidence estimate doesn't need to be interpretable as an autonomous quantity. Still, it can be utilized to develop a better predictor of trust while predicting the model.

Calibration of modern neural networks
A natural probability distribution is received by applying a SoftMax layer on the neural network's output for classification problems. On the contrary, recent work has indicated that modern neural networks are adversely calibrated despite higher generalization estimates [17]. Several changes were studied to neural network design and training recently and consequently associate this pattern with increases in model capacity and a type of overfitting. A certain increase was noted in the negative log-likelihood (NLL) after a specific point, indicating that the model exceeds the NLL loss irrespective of test accuracy overfitting [17]. This is particularly possible with the NLL loss. The loss can also be reduced by pushing the anticipated probability distribution across output classes even after the correctly classified train points. In particular, the probability anticipates from modern neural networks can be overconfident. These findings are supported with [20], which indicates that deep neural networks can witness the conventionally reinforced idea that large models are poorly generalized irrespective of regularization. Guo et al. [17] have recommended that the overfitting observed during training does not show in the generalization error but rather in the accuracy of confidence estimates. Previous studies have explored confidence estimate calibration through neural networks but need an ensemble model for the objective of calibration becomes expensive [21].

Image improvement techniques
Different techniques can be adopted to improve image quality, such as adjusting contrast and brightness, dodging, and burning (adjusting the brightness in an area), color balance, and cropping [22]. These methodologies are considered traditional techniques. The contrast, colors, and brightness depend on the scene's characteristics, the settings of the devices, and the quality of the components [23]. The non-traditional image enhancement techniques are: filtering linear (linear filtering), non-linear contrast adjustments (non-linear contrast adjustments), random-noise reduction, filter models for noise reduction (pattern noise reduction filters), and color processing [24]. Linear filtering techniques, such as sharpening, deblurring (anti-blur), edge enhancement, and deconvolution (correction technique based on an algorithm that allows reconstruction of the missing elements on a statistical basis, remove the disturbing factors and make it possible to create a higher quality image), they are used to increase the contrast of small details in an image [25]. Non-linear contrast adjustment techniques include gamma corrections, scale transformations of gray, and curves and lookup tables. These techniques are used to adjust the contrast in selected brightness ranges in an image [26]. Random-noise reduction techniques include low pass filters, blurring filters, median, and speckling (creating images from spots). Instead, the patterns of filters for noise reduction (Pattern noise reduction filters) identify patterns that replicate in the image and allow users to remove them selectively. Color processing includes transformations of the color space, pseudo-coloring (pseudo coloring, also called color level coding) of hue, and finally, the adjustment of saturation [27]. These techniques can change the characteristics of objects in an image.
Other approaches further determined the concern of MCP about high confidence predictions in tasks closely associated with failure prediction [17,21]. Previously, the temperature scaling method was used for mitigating confidence values for out-of-distribution detection and confidence calibration. On the contrary, this does not influence the confidence score ranking and; thereby, the variance between correct predictions and errors exists. A similar objective of learning confidence in neural networks was presented by [4]. The work varies by mutually emphasizing out-of-distribution detection and learning classification probabilities and distribution confidence scores.
Additionally, the predicted confidence score was used for interpolating output probabilities and target, while TCP was defined as an appropriate metric for failure prediction. An adjunct to the Bayesian neural network was proposed by [21] by allowing neural networks to produce well-calibrated uncertainty measures. A proper scoring rule was used as a training criterion for corresponding to a model prediction's exponential crossentropy loss value.
Many tools are used to enhance images, and these tools are further divided into two techniques: Point Technique and Spatial Technique. The method, called point, has some methods such as contrast, stretching, clipping the noise, modification, and coloring it, which is called pseudo [28]. Most of the time-image processing is used, which is also used in many operations. Another Spatial technique is also used in processing the image. All of the operations used in this technique are called linear operations, which are mainly used today [29]. The main reason for using this technique is that these operations are very easy and straightforward. Their implementation is also not too complex compared to non-linear operations used in the point technique [30]. Non-linear methods are used primarily at the edges of images and to find the complete details, but linear techniques are mainly used to blur and distort. Also, non-linear methods cannot remove noise from those images because they always contain noise due to their randomness [31]. For instance, in the past, many people used images to capture films with some voice which can cause recording the noise, and this noise needed to be removed. When images' signals are generated, the digitization process is also used, which mostly captures the noise [32].
Digital images produce large amounts of data to be stored. Therefore image compression techniques reduce memory requirements by limiting the data to be recorded. Lossless compression (without loss of information) minimizes the size file eliminating redundant information [33]. Therefore, the content of an image is not altered when it is decompressed. Lossy compression (with data loss) achieves a more significant reduction in file size by removing both redundant and irrelevant information. Since the irrelevant ones cannot be reconstructed when viewing an image, this type of compression causes an inevitable loss of image content and the introduction of artifacts [34]. The higher the compression rates, the greater the loss of information.
The objective of the reconstruction is to eliminate a sort of interference present in the image, called noise, understood as the superposition of unwanted signals on the signal of interest [35]. In the presence of noise, the image typically has a grainy appearance. Still, it may contain real "gaps" in the case of salt and pepper noise in which, randomly, a percentage of information in the image is completely lost [36]. Typically, this is caused by problems in signal transmission (as in the case of medical images) or by poor lighting in the scene. The purpose of denoising is to remove interference in the signal, resulting in a defined, noise-cleaned version of the image. All fundamental structures have been maintained, and the noise eliminated [37].
The human body is a complex system, and data acquisition on its static and dynamic properties produces large amounts of information. One of the biggest challenges is acquiring, processing, and displaying data about the body so that that information can be viewed, interpreted, and used to allow its analysis in diagnostic procedures and assist in therapies [38]. In many cases, the presentation of information about the human body in images is the most efficient approach to address this challenge. Medical images are produced by the interaction of some kind of energy with the human body's tissues, organs, or systems [39]. Producing medical images is always related to specific power (electromagnetic, mechanical) interaction with the matter. The image is visualized using a contrast parameter, determined by some physical characteristic that differentiates the different tissues, organs, or systems [40]. Except for ultrasound, which uses mechanical energy, most images interact with electromagnetic energy and the human body.

Study gap
Therefore, this paper presents a failure prediction model with deep neural networks. A new confidence criterion was introduced based on using the TCP for offering theoretical confirmations in terms of failure prediction. A new method was introduced for learning a predefined target confidence criterion from data as the true class was unidentified at test time. Bayesian deep learning and collaborative approaches discussed connections and differences associated with failure prediction work.
The study is significant as it proposes a specific method for learning failure prediction models with deep neural networks with a confidence neural network based on a classification model. The experimental results validate substantial enhancement from strong benchmarks on different semantic and classification segmentation datasets considering the efficacy of the proposed approach. Figure 1 shows the approach presented in this section was assessed for predicting failure in image segmentation and classification. Initially, comparative experiments were performed alongside Bayesian uncertainty estimation and state-of-the-art confidence estimation methods on different datasets. These findings were then conducted by a comprehensive investigation of the impact of the confidence criterion, learning scheme, and training loss in this approach. Lastly, a few portrayals were provided for obtaining further insight into the behavioral approach. Al-Dmour Journal of Big Data (2022) 9:71 Experimental data

Data sets
Image datasets were used to conduct experiments on different scales and complexity. For instance, minimal and straightforward images of digits were comparatively provided through MNIST and SVHN datasets [41,42]. Similarly, additional details were presented regarding object recognition tasks on low-resolution images through CIFAR-10 and CIFAR-100 [43]. Moreover, CamVid [44] was used to report semantic segmentation experiments using a contemporary road scene dataset. The study employed the dataset's standard validation set for testing in some circumstances to compute additional metrics because ground truth is not publicly available for some test sets.

Network architectures
This study has followed the classification of deep architectures as presented by [45] for an appropriate comparison. They vary from minimal convolutional networks for SVHN and MNIST to greater VGG-16 architecture for the CIFAR datasets. The study conducts an investigation to several design architectures of the MLP neural network, which relates to different quality results. Such that, a multi-layer perceptron (MLP) was added with one hidden layer for MNIST to investigate small models' performances. The proposed design structure is likely to be expandable to different hardware specifications and accuracy constraints. Therefore, a SegNet semantic segmentation model was applied for CamVid based on the proposition of [46]. The penultimate classification network layer was connected with ConfidNet, the prediction network model integrated into this study. It is comprised of a succession of five dense layers. Such architecture variants have been investigated, which lead to similar performances. ConfidNet layers were trained before fine-tune the duplicate ConvNet encoder committed for estimating confidence following the learning scheme. ConfidNet was adapted for semantic segmentation by preparing it entirely convolutional.

Assessment parameters
The evaluation of failure prediction was done through predefined parameters as proposed in [17]: AUROC, FPR at 95%, AUPR-Error, and AUPR-Success. In this regard, the core emphasis will be shifted toward AUPR-Error for computing the area under the Precision-Recall curve through the positive class errors.

Comparative findings on failure prediction
Uncertainty estimation and competitive confidence approaches were encompassed for demonstrating the method's effectiveness. These approaches encompass Monte-Carlo Dropout (MCDropout) [10], Maximum Class Probability (MCP) [17], Trust Score [20]. Table 1 summarizes comparative results. Initially, it was observed that baseline methods were outperformed through the proposed approach in every context with a considerable variant on minimal datasets or models. This shows the adequacy and appropriateness of TCP in failure prediction, and ConfidNet is competent to be fulfilled as confidence criterion.
Better results were also presented on minimal datasets or models through the Trust-Score method, including MNIST, enhanced baseline. On the contrary, the effectiveness of ConfidNet was majorly seen on larger and complicated datasets, whereas the performance declines for TrustScore due to high dimensionality issues with distances. The number of training neighbors and test samples was drastically reduced through computational complexity, where each training pixel was a neighbor in semantic segmentation.
Random samples were conducted in each train and test image classification for computing a minimal percentage of pixels in TrustScore. On the contrary, Confid-Net showed efficacy in its durability, speed, and output. State-of-art performances were further enhanced, considering confidence measures on dropout layers.  showed side-by-side two samples with a similar distribution entropy. A misclassified clustering sample is presented in the left image, whereas an accurate prediction can be shown in the right one. The image confidence was represented from a correct prediction with [0.60, 0.40] distribution, while false confidence was presented with one [0.40, 0.60] distribution. Based on this discussion, an incorrect image can be differentiated from an accurate prediction despite having similar clustering distributions. Figure 3 portrays the performance of ConfidNet and other metrics for SHVN and CIFAR10 datasets as depicted in risk-coverage curves [8,11]. A threshold was used as a selection function for corresponding the probability mass of the non-rejected region. From the performance of both datasets, a better coverage was presented by

Effect of learning variants
The impact of fine-tuned ConvNet was assessed initially in this study. Significant enhancements were fulfilled regardless of fine-tuning in terms of baseline, as presented in Table 2. The performance of ConfidNet was improved in every context by almost 2% after allowing corresponding fine-tuning. It was noted that no significant improvement was brought under consideration regardless of deactivating dropout layers. Training ConfidNet was experimented on a hold-out dataset undertaking the small number of errors available because of deep neural network over-fitting. Table 3 presents findings for all datasets based on validation sets with 15% of samples. A reduction was observed in a general performance when utilizing a validation set for training TCP confidence. The decline was particularly pronounced for small datasets, where models achieve validated and ≥ 95% trained accuracies. In particular, no more significant absolute number of errors was obtained with a minor validation set and a high accuracy for the validation set compared to the train set. On the contrary, the confidence estimation was based on models with test predictive performance levels, similar to baselines. The gap between validation accuracy and training accuracy was significant on CIFAR-100, which indicates the modest enhancement for failure detection via the validation set. One of the solutions would be to elevate the validation set size, but this would influence the prediction performance of a failure model. It was observed that the approach could be improved by training ConfidNet on the validation set with models reporting low or middle test accuracies.
ConfidNet was trained and then compared with MSE loss to binary classification cross-entropy loss. It was observed that lower performances were accomplished on CIFAR-10 and CamVid datasets, although BCE mainly addresses the failure prediction task. Similar outcomes were also tested and presented through focal loss and ranking loss. It was intuitively observed that training was regularized in TCP by offering additional fine-grained evidence regarding the classifier quality about a sample's prediction.  This was particularly essential in the complex learning configuration where very few error samples were available because of better classifier performance. The effect of regression was further assessed to the normalized criterion. The finding shows the difficulty of correct or incorrect classification training since T CPr was lower than the TCP for small datasets, including CIFAR-10.

Qualitative assessments
A portrayal is represented on CamVid to understand the approach for failure prediction better. Higher confidence scores were produced in this approach for accurate pixel predictions and lower ones on mistakenly forecasted pixels, allowing the user to detect errors effectively in semantic segmentation (Fig. 4).

Discussion
According to the experimental results, the level set approach may obtain an accurate segmentation result with adequate information. When the scene in the photo is more complicated, however, the level set approach cannot produce the necessary segmentation result. As a result, specific pattern recognition algorithms are developed to provide additional information about the target. For example, the target's areas and each pixel's likelihood corresponds to the target category. The level set approach depends on statistics of pixels within and outside the contour throughout the contour evolution process, such as the mean, weighted mean, and probability model regarding areas [47]. The level set approach is more akin to an information integration method in that it employs the energy functional minimization principle to generate a potential function. Furthermore, this possible function can be designated as the probability function, determined using probabilities and Bayesian approaches. Even though the pixels are relatively similar, the inaccurate probability map and the adjusted prior shape significantly influence, leading comparable pixels to be separated. However, we may use the concept of superpixels [48]. The majority of image segmentation approaches may be summarized as extracting and using information from pictures. As a result, the most significant challenge is to create a dynamic hierarchical organized picture representation.

Fig. 4 Inverse confidence patterns
The explicit representation of spatial changes and picture noise in our mathematical formulation of test-time augmentation is based on an image acquisition model. It may, however, be simply adapted to accommodate more generic transformations such as elastic deformations [49] or to include a simulated bias field. In addition to the range of possible model parameter values, the prediction result is also affected by the input data, such as picture noise and object modifications. As a result, a proper uncertainty assessment should consider these elements. For regression problems when the outputs are not discretized category labels, the variance of the output distribution may be more appropriate for estimating uncertainty than entropy.

Conclusions
A new confidence criterion was proposed for the failure prediction model with deep neural networks to offer both empirical pieces of evidence and theoretical guarantees for addressing failure prediction. A specific method was presented with a confidence neural network and application of ConfidNet based on a classification model for learning this criterion. Findings indicated a substantial enhancement from strong benchmarks on different semantic and classification segmentation datasets for validating the efficacy of the proposed approach. The application of ConfidNet can be integrated for estimating uncertainties in multi-task learning and domain adaptation. The majority of image segmentation approaches may be summarized as extracting and using information from pictures. As a result, the most significant challenge is to create a dynamic hierarchical organized picture representation. Furthermore, building a multi-objective matching approach would allow the proposed system to handle more complicated situations.
Additional work is required to refine the offered approach and implement the supplied prototype in the actual circumstance of segmenting the brain tumor. To begin with, just the grey level is used as the deep network's input in this research; in the future, we may use other features, such as texture features, as the deep network's input. Furthermore, additional brain tumor MRI data must be obtained on an ongoing basis. More data will help our suggested technique and other tumor classification systems.