A predictive noise correction methodology for manufacturing process datasets

In manufacturing processes, datasets intended for data driven decisions are majorly generated from time-sequenced sensor readings. Industrial sensor systems are prone to transmit inaccurate readings, which result in noisy datasets. Noisy datasets inhibit machine learning and knowledge discovery. Using a multi-stage, multi-output process dataset as an experimental case, this article reports a methodology for replacing erroneous sensor values with their predicted likely values. In the methodology, invalid values specified by process owners are first converted to missing values. Then, ReliefF algorithm is used to select the most relevant features to progress for prediction modelling, and also to boost the performance of the prediction model. A Random Forest classifier model is built to predict replacement values for the missing values. Finally, predicted values are inserted into the dataset to fill in the missing entries. With many attributes having a significant number of erroneous values, the invalid values replacement is done one attribute at a time. To do this systematically, the process flow direction and stages in the manufacturing process are exploited to partition the dataset into subsets for model building. The results indicate that the methodology is able to replace erroneous values with likely true values, to a very high degree of accuracy. There is a paucity of this type of methodology for dealing with invalid entries in process datasets. The methodology is useful for both missing and invalid value correction in process datasets. In the future, the plan is to inject the prediction models into streaming data to simultaneously enable erroneous value correction and predictive process monitoring in real-time.

noisy datasets are difficult to interpret by those not familiar with the system.Where noisy values are significant in the dataset, they inhibit machine learning performance and can also lead to false predictions [2].It may be infeasible or expensive to change sensors or the system, and so decision makers are faced with no other option than to rely on noisy datasets.
This research explores the use of classification prediction modelling to replace erroneous values in time series datasets of multi-stage manufacturing processes.Given that such datasets may have hundreds to thousands of sensor-generated parameters, feature set reduction through feature relevance ranking is considered for boosting the performance of the prediction model.A dataset [3] that was generated using a Manufacturing Execution System was used for the research.The 116-feature, 14,088-entries dataset included timeseries information for: ambient factory conditions; material property specifications; machine operating parameters and dimensional characteristics of the output exiting from each of the two stages.In addition to the dataset being multivariate and of medium dimensionality, it contained a lot of erroneous values.The objective was to denoise the dataset by replacing the erroneous values with values predicted using classification modelling.Noisy erroneous values in the batched dataset are first replaced with missing values.Then missing values are inputted with predicted values using a built classifier model.
An end-to-end methodology for replacing erroneous values in time series datasets of multi-stage processes is proposed.The proposed methodology comprises data exploration, data pre-processing and preparation, feature reduction to boost the performance of the classification model that is built, results parsing with cleaning and predicted values inputting.Data exploration, preparation, parsing and inputting are undertaken using python-based libraries and codes.The machine learning tasks are performed using WEKA open source software.Although the methodology is described for offline modelling, the methodology culminates in a model that can be used for correcting erroneous streaming data.The built model can also be used for real-time predictive process monitoring.In other words, the methodology delivers multiple benefits aside correcting erroneous values in a dataset.
Consequently, this article is structured as follows: "Dataset de-noising and erroneous values correction " section discusses data de-noising and erroneous values correction in large datasets; "Materials and methods" describes the materials and methods and presents the proposed methodology; "Results and analysis" section reports the application of the methodology with results and analysis; "Discussions and conclusions" section discusses the research implications with concluding remarks.

Dataset de-noising and erroneous values correction
Noise as it relates to data is the existence of irrelevant and erroneous values in a dataset.
Examples of irrelevant values are data points or attributes with constant (or near constant), non-changing values.An attribute with serialized values such as row ID numbers or time stamps, where these are not needed for a data mining task at hand, may also be considered as noise.Errors in a dataset can occur in various forms such as extreme and unrealistic values, missing values and suspicious values such as negative values.
Classification of noise is generally case-specific [4], and data modellers need to establish what is noise and what is not noise in the dataset.
Real-life datasets are often noisy [5,6], to varying degrees such as low, moderate, high and extreme [7].Table 1 can be used to explain these degrees in a given m x n dataset matrix, where m is the number of attributes or columns and n the number of instances or rows.In the Table 1, ν are the valid values while ε represent erroneous values.Attributes A1 and A3 have less than 5% erroneous values and can be considered as having low noise.Attributes A2, A4, A5 and A7 may be termed as moderately noisy, A6 has high noise, while A8 is extremely noisy.
Noise may be inevitable is some datasets, for example, sensors which are the most ubiquitous data source in many data driven decisions, are known to generate erroneous data as a result of mechanical and electronic disturbances [7,8,9].If the system is such that the sensors are very sensitive, then such systems may regularly generate noisy

Table 1 Example dataset showing pattern for erroneous values
datasets where mechanical disturbances and electrical fluctuations are regular.Significant noise in a dataset can lead to erroneous statistical inferences which may misinform decision makers.Where noise ranges from moderate to extreme, it can negatively affect the speed and accuracy performance of machine learning algorithms [10].Robust approaches are need when handling noisy datasets.Data cleaning is a time consuming and effort demanding task in a machine learning process.Data cleaning can be approached using integrity constraints methods [11], statistical tools [12] and machine learning models [13].A number of options exist for addressing erroneous values in a multivariate dataset [14].In one method, data instances with error values can be discarded, but this may significantly reduce the size of the dataset where the dataset is small to medium size and where the noisy instances are substantial.If columns or attributes with erroneous values are discarded [as in 15], decision makers may miss some critical attribute relationships and the dynamics that take place in the system.Additionally, the expunged readings may hold valuable time-related information, and so cannot be excluded for a particular decision.Another method is to replace erroneous values with mean or mode values for the attribute that has the erroneous values, but this may lead to wrong statistical or learning inferences where a significant number of values are replaced.Yet another method is to use a non-noisy surrogate dataset [16], to replace the erroneous values.The surrogate dataset can come from the same system, such as from prior data collection.This may not be an option if the system is prone to generating noisy data (as it is the case with sensors), or if the conditions are different.A fourth way of dealing with erroneous values is to mine them using a machine learning approach, either learning with noise [7] or replacing the erroneous values [13,17].Replacing erroneous values enables a reconstructed dataset to evolve.A reconstructed dataset can be of value in other ways aside the prediction of some values.
In the literature, researchers' efforts have been mainly focused on developing and evaluating algorithms and methods that are robust to noise [6,7,[18][19][20] or for filtering out noise [21,22].There is a paucity of studies that detail the methodological end-to-end process for correcting erroneous values in a dataset.There is also scarcity of studies that describe how process datasets can be de-noised through erroneous values correction.Meanwhile, the heterogeneity of datasets and variety of machine learning goals makes a generalized de-noising method to be infeasible.As such, studies and methods relating to data noise correction are needed to account for the uniqueness of datasets and machine learning goals.

Materials and methods
This research aims to develop a methodology that uses classification prediction modelling to replace erroneous sensors-generated values of multi-stage process datasets.This section details the tools for analysis.

Dataset of multi-stage, multi-output process
In early 2020, a dataset was posted by Liveline Technologies, on Kaggle [3], which hosts open source dataset repositories for different types of datasets representing a variety of systems and problem types.The timeseries dataset was posted to elicit responses from data modellers on how accurately they can predict values into the future.The dataset was found to be expedient in other ways.Firstly, the dataset comes from a real-life multistage continuous flow manufacturing process, so mining it would generate real-life insights.Secondly, the system that the dataset represents is typical of many process types, with activities in parallel and in series as well as assembly points.Additionally, it is a time series dataset which typifies the type of data that is streamed and batched in many systems.The results of a machine-learning based modelling approach could be generalizable for many types of systems and processes.The dataset contained a lot of erroneous values.It is possible that the original dataset was perturbed by the dataset providers, but erroneous data entries occur in real-life systems datasets, especially those from sensors.The occurrence may be consequential, whether frequent or infrequent.Documenting the process of correcting a very noisy dataset would be beneficial to decision makers in systems who are at any time faced with dealing with such datasets.
The dataset, the description of the data sampling rate and the background information about the system are available at Kaggle.com [3].Based on the information disclosed by the dataset providers, the system is a two-stage continuous flow manufacturing process, which can be depicted as shown in Fig. 1 [23].Outputs from each stage are measured in 15 locations each, and they are the primary measurements to correct.The raw dataset (in csv format) consisted of 116 variables (column labels, see Appendixs 1, 2, 3, 4) and 14,088 observations (row data).The 14,088 observations are based on time stamps in 1second interval.As imposed by the dataset providers, the key variables of concern are the 15 stage 1 output measurements and 15 stage 2 output measurements (see Fig. 1).These measurements were given in mm, suggesting that they are dimensional features.
The dataset owners provided information that the dataset was very noisy (see Fig. 2 excerpts for time series plots of some of the variables).It was observed, from the raw dataset, that there is value variation in some key process output features, where the values are required to be fixed (see example charts in Fig. 2a, b).This is an indication that the values may be noisy.Dataset providers confirmed this.A lot of the values are also zero, suggesting they are inaccurate readings, since dimensional features must take on positive values.So, the main challenge was to de-noise and correct the dataset.

Feature selection modelling
A feature (or attribute) is an individual measurable property of the system that is being investigated [24].In a dataset, it is referenced using the column labels.The focus of feature selection is to select a subset of variables from the input which can efficiently Fig. 1 Process flow representing the multi-stage continuous flow production system describe the input data while reducing effects from noise or irrelevant variables and still provide good prediction results.Feature selection is the systematic reduction of the feature space of a dataset.The key objective is to improve the accuracy and cost of the learning model [7,25].In feature selection, only those features (variables or attributes) that show high sensitivity to the target output are retained [26].Others are considered as noise or irrelevant to a given machine learning task [27].Other benefits include reducing the dataset size for storage [28] and to pinpoint key input variables that affect an output variable [29].Feature selection or reduction allows the application of more complex algorithms that would otherwise be infeasible where the full dataset is used [30].
Examples of feature selection algorithms that have been widely used within a manufacturing system context include Principal Component Analysis [31], ReliefF [32] and Correlation Based Filter Selection [33].These feature selection algorithms have their specific uses and scope, benefits and limitations and types of dataset they can handle.For instance, Principal Component Analysis is very useful for extracting relevant features from high dimensional data such as image data [34], while Correlation Based Filter Selection works well on numeric data [35].ReliefF and other tree-based algorithms are able to handle noise better than others [36].
In its simplest form, a feature selection workflow can be described as shown in Fig. 3.The workflow describes the use of a filter-based algorithm that uses a ranking scheme to rank features from most relevant to least relevant, in a supervised manner.
As shown in the Fig. 3, the dataset containing all features is evaluated using candidate feature reduction algorithms.The subset is then evaluated using candidate machine learning models [24,37] or one that has been chosen previously.For ranking-based feature selection approach, different cut-off points for the ranked features are chosen and evaluated to identify the optimal cut-off point beyond which learning performance becomes marginal.Selected features are not generalizable for the entire dataset.They are specific to the target or output attribute, and so in a multi-target attribute dataset, different feature sets need to be extracted for each target [38].In other words, the process (Fig. 3) is repeated as many times as a target attribute is to be predicted.
There is a trade-off between reduced feature space dimensionality and prediction accuracy [15].Data modellers have to choose whether to proceed with the full set or a reduced set, depending on the prediction accuracy of using the reduced set in comparison to using the full set.In a very large dataset with thousands of features, it is good practice to exclude non-informative and irrelevant features first [39] as this can enhance the feature selection process.Other pre-processing tasks that enhance feature selection modelling include using clean instances of the dataset [40] and data type transformation, such as normalization, discretization and nominalization, as would normally be fulfilled prior to classification, clustering and association modelling, for example.
ReliefF ranks features from most sensitive to least sensitive to the class attribute.Data modellers can then decide which cut off point to test on the classification model.For this reason, ReliefF feature selection algorithm was chosen.Principal Component Analysis was ruled out as it is computationally costlier.Additionally, it uses orthogonal transformation to map features that may be correlated into a new feature space where there is minimum correlation between the new features, thereby creating new features from existing ones [15].The new features diminish the physical meaning of the original features, and requires additional modelling and computation to understand what the new features represent.Correlation Based Filter Selection was also considered, but this does not give a ranking of how features are important in explaining a class attribute.

Classification modelling
Classification prediction remains the most utilized machine learning technique [37], ahead of clustering and association rules, even in the context of manufacturing systems [41].Classification prediction involves the process of using observations of a known group to estimate observations of an unknown group [8,42], on the basis of the least estimation error [43] and modelling cost [44].Accuracy of prediction or estimation error is based on measurement error (or distance) between the predicted and observed actual value.Cost of modelling is a function of the total time to train and test the model.
Classification prediction, as it is with other machine learning methods, is fulfilled in two main stages [45].In the first stage, the classifier model (algorithm and specified parameters), is built by training it to memorize observations and patterns of a portion (train set) of the known group.In the second stage, the trained model is validated on the other portion of the known group (test set) to estimate the accuracy rate when using the classifier model on an unknown group.The use of the classifier model on an unknown group is based on the known group and unknown group being homogenous in terms of identical number and type of attributes, and representing the same system.
Neural Networks, Linear Regression, Support Vector Machines and Tree-based Classifiers, with their variants, have been the most prevalent learning algorithms in use in manufacturing [46].These classifier learning algorithms handle data differently [46], deal with noise differently [47], and respond differently to data types.For instance, tree-based classifiers work well with nominal data types, they are also more responsive to certain data types than with others.For example, Neural Networks and Linear Regression models are well suited for numeric data types; J48 learner accepts only nominal-type data (such as discrete and categorical), and Random Forest works is able to handle both numeric and nominal data types.Tree-based classifiers can handle small to medium sized datasets very well [48,49].The tree-based classifiers such as Decision Tree and Random Forest have been the dominant classifiers in machine learning-based noise correction tasks.Some algorithms, such as Neural Networks, Logistic Regression and Support Vector Machine, perform badly when data is not normalized and where scale differences exist amongst features.The tree-based classifiers are rule-based, and so do not require pre-processing through data normalization.Having foreknowledge of these information helps data modellers narrow down the candidate classifier algorithms to test on a dataset, and not test the full spectrum.
In data modelling with classification learning, it is common to use a boosting method such as an ensemble of multiple classifiers [8] in order to improve prediction accuracy.Random Forest classifier is an ensemble of tree-based classification models, and is known to be more accurate than J48 and Decision Tree [5].Feature reduction has also been used to boost the performance of classifier models.
In choosing the modelling algorithm to use in building a classification prediction model, it is common to test different algorithms [7,8,15] and choose the one with the most promising performance (prediction accuracy and training/testing speed).

Software tools for building machine learning models
Identification and selection of software tools is rarely reported in detail [50].Familiarity with the software, application domain of the software, versatility of the software in terms of data types and formats and learning algorithms handled, ease of communicating with other software and other factors are used to identify and select candidate machine learning software tools [50].Software tools with Graphical User Interface (GUI) such as KNIME and WEKA have been used extensively in data driven decisions relating to manufacturing systems [50].These and other GUIs enable machine learning models to be built quickly with little or no programming code input.One major limitation of these GUIs is that programming codes may be needed for data transmission between the software and the database that warehouses the dataset.Additionally, they individually have their specific limitations.It is therefore possible that one may need to rely on codes written in C, Python, Java or R language, to automate some tasks such as data transmission and text extraction.
The use of a relational database is inevitable when learning on batched data, since batched data are by nature stored data.SQL and other relational databases are common in manufacturing [51].These databases provide the necessary structure for storing and organizing data to facilitate quick and easy access and manipulation.Microsoft Excel has also proven useful [50].Relational databases have the type of structure (columns and rows) that is responsive to queries and manipulation using programming codes.In the current study, WEKA was used for the machine learning tasks.Python was used to generate computer codes to enable data pre-processing, data parsing and data extraction of relevant information.Microsoft Excel was used as the database and for opening csv data file formats.

Methodology for correcting noise in multiple attributes in a dataset.
The methodology that was developed is depicted in Fig.The first three steps represent the data exploration and pre-processing of the data, including data transformation.The subsequent seven steps are the prediction learning steps.In the methodology, prediction learning is fulfilled one attribute at a time.In a situation where there are multiple attributes to be corrected, these steps are repeated for each attribute that is noisy.The final step involves replacing the missing values with the predicted values, which culminates in the corrected dataset.
In the developed methodology, classification was used.A clustering or association mining approach can also be used.In order to reduce the feature space of the dataset and boost the performance of the classification model, feature reduction modelling was applied.Classification modelling would therefore serve two purposes to: (a) validate the ability of few selected features to represent the entire feature space and (b) to predict replacement values for erroneous values.For the above stated reasons, the prediction learning aspect is majorly based on feature selection and classification prediction modelling methods.

Fig. 4 Flow process for correcting noise in multiple attributes in a dataset
The methodology that is depicted in Fig. 4 was developed while correcting the noisy dataset described in "Dataset de-noising and erroneous values correction" section.It was developed through an iterative process that tested other possible pipelines [50].In the next section, the developed methodology is re-applied to the same noisy dataset to explicate it as a reusable approach.

Results and analysis
The dataset used in this study contained a significant amount of erroneous values in those attributes representing output variables.The objective was to denoise the dataset by replacing the erroneous values with values predicted using classification modelling.The results presented here are the steps and outcomes from adopting the methodology shown in Fig. 4 Explore data and establish noise Data exploration was fulfilled using Panda, a data analysis tool for python programming.Time series plots of attributes were analysed for trends and anomalies.Many of the stage 1 and stage 2 output measurements exhibited trends similar to those shown in Fig. 2. In the dataset, noisy values were identified as values outside of a user-specified range, including zeros and outliers.
A correlation matrix was generated to determine if attributes were strongly, moderately or weakly correlated.Strong correlations are an indication that one in a pair of attributes may be redundant.It is also an indication that statistical tools can be used to explain pairwise relationships.For example, if stage 1 measurement output and stage 2 measurement output are highly correlated, then either can be used to explain the other, through simple statistical expressions.The correlation matrix however indicated mainly weak correlations, see correlation coefficients shown in matrix of Fig. 5.

Convert noisy values to missing values
Sensor readings are erroneous due to vibrations and interferences on the sensing surface of the output feature.User-specified range of true values was used to sift out erroneous sensor readings falling outside of the range.Values outside the range were converted to 0 values, while those values within true range were retained.Then zero values were then converted to missing (NaN) values.Focus was on those stage 1 and stage 2 output measurements that exhibited significant erroneous values.The conversions were performed using Pandas and NumPy python-based libraries and codes (see Algorithm 1).Table 2 shows the attributes with missing values.Remove non-relevant attributes Some attributes do not add value to a learning task.When included in the dataset for learning, it hinders the model performance.For the learning task at hand, the time stamp attribute was not needed.In addition, the set point measurements in the dataset were also not needed, since they were constant, non-changing values.If the correlation analysis had established correlation between any two attributes, one can be discarded as being redundant.
Attributes with significant missing values can also be discarded.Attributes with more than 70% missing values fall into this category.The reasoning behind this is because the non-missing which would have been used to predict the missing values is insufficient to comprehensively describe the dataset pattern for the entire set of values for the attribute.Three attributes in stage 1 output measurements (5, 6 and 11) and one in stage 2 output measurements (4) fell into this category.

Set class attribute
A class attribute can be described as the particular attribute whose values are to be predicted, also known as the target, dependent or output variable.In the dataset being worked on, there are multiple class attributes since there are multiple attributes whose values are noisy, something that is common in real life datasets.This is a classic multitarget classification problem, where many attributes are considered as the target.There are two known approaches for treating this type of multi-class classification problem [52].One is to build a separate model for each target.The second approach is to use a multi-target classification algorithm [53], such as Random Linear Target Combination and Multi-Objective Random Forest [54].Using single-target classifiers for a dataset that has many class attributes would be an arduous and time-consuming task, if one is to build a separate model for each class attribute.Despite this, they perform just as well [52] and even sometimes better [54] than multi-target classifiers.Moreover, single-target classification approach is more prevalent and more straightforward [52].For the few numbers of attributes that are to be corrected in the current dataset, a single-target classification approach is sufficient.
In this research, the prediction learning is on the basis of a single-target classification approach, hence steps four to ten (see Fig. 4) is done repeatedly for each of the class attribute that is to be corrected for noise.The approach is a systematic one where attributes with erroneous values are corrected, one attribute at a time [55]

Exclude non-relevant attributes
The dataset being corrected represents a multi-stage manufacturing process.Specifically, the system is a two-stage continuous flow manufacturing process.It is better to split the class-specified dataset into relevant and not-relevant sets to improve learning performance [56].With reference to Fig. 1, relevant to predicting stage 1 output measurements are those variables upstream of stage 1 output measures.Stage 2 processes and stage 2 output measures are not relevant, since they are downstream of stage 1 output measurements.Similarly, stage 1 output measurements and stage 2 processes are relevant for predicting stage 2 output measurements.Including stage 1 process parameters in the set for predicting stage 2 output measurements will be a redundancy, because stage 1 output measures are the results of stage 1 process parameters.
Bearing this in mind a dataset is carved out of the class-specified dataset.The generated dataset includes the class attribute as well as those attributes that represent process parameters upstream of the attribute.As an example, if stage 1 output measurement 2 (S1M2) has been chosen as the class attribute, then the dataset is pruned to exclude all other stage 1 output measurements and stage 2 process attributes with stage 2 output measurements.On the other hand, if stage 2 output measurement 7 (S2M7) is selected as the class attribute, the dataset includes the stage 1 output measurements and stage 2 attributes, but excludes other stage 2 output measurement attributes.The stage 1 output measurements should have been corrected prior to this, as noisy attributes can cause prediction errors.It is possible to use only the stage 2 process attributes (representing machines 4 and 5) to predict the stage 2 output measurement values, but by including stage 1 output measurement attributes, it is assumed that stage 2 output is formed from stage 1 output.In other words, stage 1 output measurements can be used to explain stage 2 output measurements.If a stage 3 process existed, the same analogy is applied, and so on in subsequent stages.To eliminate confusion, it is best that the dataset attributes are ordered is such a way that the ordering represents the actual flow process.

Partition non-missing dataset
Although tree-based algorithms such as Decision Tree and Random Forest have been known to learn noisy datasets very well [47], performance is often degraded where noise exists in a dataset.As a result, the methodology proposes using a clean (not missing values) dataset to build and test the prediction model.In order to do this, a dataset that has non-erroneous (non-missing) values for the class attribute is carved out of the class-specified dataset (which has both missing and non-missing values).This dataset is progressed for building the classifier model for predicting the class attribute.The built model would be applied to the class-specified dataset to predict values for the missing values.The data partitioning was accomplished in the WEKA explorer environment.

Apply feature reduction modelling
In the methodology, the main function of feature reduction is to boost the performance of the classifier model.To confirm that the few selected features can adequately represent the entire feature space, a classifier model is applied to the feature-reduced dataset (as described in Fig. 3).The performance is compared with learning using the full feature space.If prediction accuracy is not degraded significantly but modelling cost is improved, then the selected features are capable of representing the entire feature space, otherwise, the full dataset is used.
For the dataset being corrected, a feature reduction model was built on the basis of ReliefF feature selection algorithm [57,58].In the current research, subsets of the dataset made up of the top ranked 5, 10, 15 and 20 ReliefF selected attributes were assessed using models built with Logistic Regression (LR), Random Forest (RF), and Neural Networks (NN).The results (see Fig. 6) indicate the optimal number to be the first 10 Reli-efF selected attributes, beyond which there is marginal or no improvement in prediction accuracy.
Table 3 has been used to compare a Random Forest classifier model prediction accuracy using: (a) the full dataset; (b) the top 10 ReliefF (missing and non-missing) and (c) the top 10 ReliefF using non-missing set only.From the results presented in Table 3, it can be concluded that there is improvement in the prediction accuracy when using the ReliefF reduced subset instead of using the full dataset.There is only very marginal improvement of 2.2% by using the non-missing (clean) subset instead of using the subset that contains both missing and non-missing values for the class attribute.For the current dataset problem, there is therefore no need to set apart a clean subset to be used in building the classifier model, as this would add to the computational steps and time.The partitioning step 6 in the methodology (see Fig. 4) is therefore bypassed for the current dataset, but would be needed for a dataset where missing values significantly degrade the performance of the built classifier model.
The results indicate that the top 10 ReliefF selected attributes are able to represent the entire feature set for predicting S1M1, and are also able to boost the performance of the classifier model.The prediction results for the three candidate classifiers (see Fig. 4), suggest that a Random Forest-based classifier model is the best learner for the dataset.On this basis, the classification prediction is progressed on a hybrid modelling approach namely, ReliefF to reduce the feature space and Random Forest for value prediction.

Build classifier model
A Random Forest classifier model was used to evaluate the feature selection subset.Random forest algorithm is a combination of tree-based classifiers, such that after a large number of trees is generated, they vote for the most popular class [48].Random Forest by nature is an ensemble i.e. consisting of multiple classifiers (tree models), the accuracy of which has been shown to be quite high for small to medium sized datasets [59].
The drawbacks for Random Forest are speed deficiencies as a result of combining multiple classifiers.In the current research the dataset is considered small to medium sized and so speed deficiency was not noticed.Moreover, the feature space has been reduced through feature reduction, which is supposed to boost the speed performance of the classifier model.Another drawback with Random Forest, like other tree-based algorithms, is that they cannot provide good estimations outside the boundaries of the training dataset, in other words they extrapolate poorly, unlike regression algorithms [59].
The current prediction task does not warrant extrapolating the prediction forecasts outside the boundaries of the current dataset, so a tree-based classifier is sufficient for the present purpose.The dataset used for training is the top 10 ReliefF reduced dataset which included missing and non-missing values for the class attribute: an 11-attribute × 14,074 instance dataset.The dataset instances were partitioned 80:20 into train and test sets, using random sampling.Mean absolute error was used as the evaluation metric.It is a prediction accuracy metric for classification models that learn numeric data types.It is indicative of the error variance between the predicted values and the actual values.It is given by the Eq. 1, where e i is the prediction error of the ith sample, and n is the number of samples.
Values of mean absolute error closer to 0 show good prediction capability of the model.Mean absolute error for the Random Forest classifier model for predicting Stage 1 output measurement 1 (S1M1) values, using the top 10 ReliefF selected attributes, was 0.0363, see Table 3.The model is saved to be applied for missing value imputation.

Apply built classifier model on class-specified dataset
The saved model is re-applied to the same dataset, but this time without partitioning the dataset instances.The reason for this is so that the dataset instances are not disordered as is the case with random sampling and partitioning.Re-ordering a disordered dataset after generating prediction results, would add to the computational complexity of the machine learning process.
An excerpt of the prediction results (instance 1 to 4 and 9671 to 9689) is shown in Fig. 7.Under the column with actual values, '?' denotes a missing value.These results are saved to a text file to enable extraction of relevant information.
Figure 8a is a plot of the prediction error for predicting S1M1.The points lying along the 0 axis are majorly the missing entries for the error values due to missing actual values.An analysis of the prediction error for the actual values showed that 99% of values lie below 0.05 mm and 56% below 0.01 mm (see Table 4).The results indicate that if 100 samples are taken, 99 can be predicted to an accuracy that is within 0.05 mm, using the Figure 8a reveals that some prediction errors are significant, values above 0.05, and they are about 1% of samples.From the plot, most of this type of errors occur where there is a significant amount of missing values.These results can be taken to imply that prediction accuracy drops significantly around the vicinity of erroneous or missing values.Prediction accuracy to ± 0.6 mm may be good or bad depending on the situational requirements.

Extract predicted values
A python-based code was used to extract and organize the relevant information from the WEKA generated result.The logic in the code created a csv file for the results.Where the actual column value is '?' the predicted value is selected, otherwise the actual value is selected.By so doing, the column in the generated csv data contains the actual values and the predicted values for the class attribute.The code is shown in Algorithm 2. The code truncates information (by extracting only relevant column values) and parses information (from text to csv).The csv data is saved as a 1

Replace missing with predicted values
Being a medium-sized dataset, Microsoft Excel was sufficient for replacing the column (S1M1) of missing values with the column of non-missing (corrected) values.Figure 9 plot has been used to show the dataset for attribute S1M1 before and after correcting for erroneous values.The plot shows that the corrected dataset has values that are consistent with true sensors readings for the system, compared with the initial dataset.This indicates the viability of the method and on this basis, the steps four to ten are repeated until the missing values in the selected stage 1 output measurement values have all been replaced with their predicted values.The combination of ReliefF and Random Forest is applied.The corrected dataset is then progressed for correcting stage 2 output measurement values.And steps four to ten are applied accordingly.
It is important to note that missing values of less than 5% are considered trivial [22,60].They are insignificant to cause any major performance degradation or cast suspicion on the results of applying a machine learning algorithm.And so, attributes with less than 5% missing values were not corrected for this research.
Table 5 shows the summary of the total number of erroneous values that were in the dataset before and after correction.The erroneous values are equivalent to the number of missing values, given that erroneous values were converted to missing values.From the Table 5, if those attributes with significant missing values (see Table 2) are excluded, the erroneous values in the dataset is reduced by about 95%.The percentage can be 100% if those attributes with less than 5% erroneous values are also corrected.For a process flow dataset, the results can be taken to mean that the methodology is able to correct erroneous values in an attribute having as much as 65% spurious values.

Discussions and conclusions
The methodology yields a number of benefits for industry practitioners.If every noisy instance were to be discarded in a dataset, decision makers may be left with a muchreduced dataset for decision making.With the methodology, dataset owners and users in manufacturing systems and systems alike, do not need to discard very noisy attributes or noisy instances.The methodology helps in the correction of erroneous values in a dataset.
Data driven predictions along a multi-stage manufacturing process has been researched [61][62][63].The studies do not explicate how to navigate the entire process of mining a multi-stage manufacturing process dataset.In other words, they do not describe in detail, data exploration, pre-processing, dataset partitioning, model building, results parsing and cleaning and analysis.The methodology and the study reported in this article bridges the above stated knowledge gap, and so provides process managers with the workflow for mining and correcting their datasets.A manufacturing process can be defined as a sequence of interlinked operational activities in a manufacturing system, where each activity leads to the next, and the flow of activities forms a whole [64].Manufacturing is generally fulfilled in stages.It was shown in this article that each stage in a manufacturing process can be used to demarcate relevant process variables that should be progressed to build a learning model.This helps in pruning the dataset so that model performance is enhanced.Data modellers for multistage process learning would find this very useful where there are thousands of process variables and tens to hundreds of stages.
Machine learning is an enabler for knowledge discovery.The methodology uses both feature engineering and classification modelling methods.Both methods on their own are able to extract complex patterns in large datasets that help provide a deeper understanding of the system that generated the dataset.The feature reduction component of the methodology was able to indicate those few attributes that can be used to explain each output measurement.This is a noise reduction strategy, as it reduces the feature space to only those that are most relevant to explaining the target attribute.Then, decision makers can use such information to understand the core relationships in their system.The classification prediction modelling process to replace missing values is similar to the process one would follow when building a classifier prediction model for predictive process monitoring.The prediction models built to replace missing values can also be taken as the prediction models to use in subsequent prediction tasks.In other words, there is no need to build another model, rather the task would be to improve training by increasing the number of training samples, which ultimately leads to an increased prediction accuracy of the model.
The dataset correction approach uses prediction to replace erroneous values in the dataset.With the approach, a predictive model is built.The built model has a very high predictive accuracy, see "Apply built classifier model on class-specified dataset" section, giving one confidence that the replacement values accurately represent what would have been true values.
The emphasis of denoising the dataset by replacing erroneous values with their likely values is so that the corrected dataset can be used for data modelling.With the corrected dataset, analysis of the system that generated the dataset can be enhanced.For instance, the important patterns in the corrected dataset can be learnt using any classification or clustering algorithm that fits the dataset.A research effort that has emanated from the current one is the development of a real-time predictive process control system, that can be used to circumvent unqualified outputs at the end of each stage in the manufacturing process.This is made possible because of the corrected dataset, the learning of the dataset and the discovery of the dataset patterns, all of which are outcomes of the current de-noising study.
The methodology presented in this article represents noise correction of batched data.The built models can be injected into streaming data to enable real-time predictive process monitoring [65].In addition, the process and model building steps can be collapsed into one algorithm to significantly reduce the time taken to manually correct the dataset.This advancement to the methodology is being undertaken by the author.

Researchers have developed various approaches to dealing with data noise (refs).
One approach cannot be said to be more superior than others, and it would appear that approaches are case specific.The specificity is due to the characteristics of the dataset as well as the objective of the data correction task.For instance, in this study, the dataset represented a multi-stage, multi-feature process.The inclusion of data partitioning in the methodological steps arises dues to the type of dataset.The use of feature relevance reduction was to improve the performance of the classifier model that was built to predict replacement values for the erroneous values.
Due to the specificity of the approaches to deal with data noise, results can hardly be compared.Approaches may be compared, but only where similar systems or datasets have been modelled.The uniqueness of the approach presented in this study is in the use of the end-to-end data mining process to implement noise correction in a dataset.The data exploration and pre-processing with feature reduction and classification modelling are all important tasks in the de-noising and dataset correction workflow.
[10] list some likely drawbacks of replacing noisy values with values predicted using a classifier model.They argue that it carries a high computational cost.This is true, considering the number of models to build when there are many attributes to correct for erroneous values.A multi-target algorithm can be considered as a likely alternative worth testing and is an area for further research.
During the analysis, it was found that prediction accuracy drops around the vicinity of erroneous or missing values.This finding has implications for real-time predictive process monitoring.Managers should be aware that prediction may degrade where there are many erroneous values prior to the value being predicted.This is another area worthy of further research.
There is a paucity of the type of methodology presented in the current research.To the best of the author's knowledge, it has not been reported in the literature.The methodology is generalizable in manufacturing systems and can be applied to non-manufacturing related datasets.In this article, the methodology has been described with reproducibility in view.In addition to this, the software tools are generic and opensource.
The process is a continuous flow process consisting of two stages (see Fig. 1).In

Fig. 3
Fig. 3 Workflow for a feature selection process 4. It is based on swapping noisy erroneous values with missing values.Machine learning algorithms treat missing values as values to be skipped during learning, whereas noisy values (which are erroneous) may be included in learning, unless otherwise specified.The methodology proposes to replace erroneous values with missing values and then impute most probable values in the missing values.The most probable values are predicted using a classification prediction model.The classification prediction model is built and trained using the clean (not missing) instances of the dataset.The trained model is then used to predict values for the missing instances.

Fig. 6
Fig. 6 Prediction accuracy performance comparison of top ranked features for predicting stage 1 output measurement 1

( 1 )
Mean absolute error = n i |e i | n model.Figure 8b shows the scatter plot for predicted vs actual values.Most data points lie along the line of best fit, indicating good prediction accuracy of the model.

Fig. 7
Fig. 7 Excerpt of prediction results for stage 1 output measurement 1

Fig. 8
Fig. 8 Plot for stage 1 output measurement: (a) prediction error for each instance (b) scatter diagram for predicted vs actual values

Fig. 9
Fig. 9 Time series plot comparing initial dataset with corrected dataset the first stage, Machines 1, 2, and 3 operate in parallel and feed into a Combiner.The output from the Combiner is measured in 15 different locations (Stage1.Output.Measurement0.U.Actual to Stage1.Output.Measurement14.U.Actual) against their set points (Stage1.Output.Measurement0.U.Setpoint to Stage1.Output.Measurement14.U.Setpoint).The output from the combiner flows into the second stage, where Machines 4 and 5 process in series.The output from Machine 5 is measured in the same 15 locations (Stage2.Output.Measurement0.U.Actual to Stage2.Output.Measurement14.U.Actual) as the output from the Combiner and compared with their setpoint measurements (Stage2.Output.Measurement0.U.Setpoint to Stage2.Output.Measurement14.U.Setpoint).The output measurements are collected via sensors mounted on the machines.Zero measurements are as a result of sensor and/or system failures.Erroneous values are as a result of the product unsteady movement and product surface contamination Received: 30 July 2020 Accepted: 8 October 2020

Table 4 Prediction accuracy summary Samples with prediction errors Percentage of values
column × 14,074 instances dataset with missing values replaced with predicted values.The column represents the corrected (non-missing, non-erroneous) data for S1M1.