A Predictive Noise Correction Methodology for Manufacturing Systems Datasets

In manufacturing systems, datasets intended for data driven decisions are majorly generated from time-sequenced sensor readings. Industrial sensors are prone to transmit inaccurate readings due to various reasons, which results in noisy datasets. Noisy datasets inhibit machine learning and knowledge discovery. This article reports a methodology that was developed to rectify a very noisy dataset of a multi-stage continuous manufacturing process, with multi-feature output. In the methodology, erroneous values are rst swapped with missing values. Then, feature reduction modelling is used as a precursor of prediction learning to improve machine learning performance. Afterwards, a classier model is built to predict replacement values for the missing values. Finally, predicted values are inputted to replace missing values in the dataset. With many attributes having erroneous values, the values replacement is done one attribute at a time. The ow direction and stages in the manufacturing process are used in partitioning the dataset to improve learning performance. With the methodology, important system relationships are identied and the dataset is learnt for predictive process monitoring. There is a paucity of this type of methodology for dealing with noisy datasets in manufacturing systems. In the future, the plan is to inject the prediction model into streaming data to enable real-time erroneous value correction and real-time predictive process monitoring at the same time. a machine learning-based methodology that was developed to correct a very noisy time series dataset of a multistage continuous manufacturing process. A dataset that was generated using a Manufacturing Execution System was used for the research. The 116-feature, 14,088-entries dataset included timeseries information for: ambient factory conditions; material property specications; machine operating parameters and dimensional characteristics of the output exiting from each of the two stages. In addition to the dataset being multivariate and of medium dimensionality, it contained a lot of erroneous values. The challenge was to correct the noisy values. In the process, a methodology was developed to guide decision makers in manufacturing systems on how to overcome the challenge of working with very noisy datasets.


Introduction
Data driven decisions in manufacturing systems rely much on time-sequenced readings from multiple time-synchronized sensors. In real-life settings, sensors are perturbed by mechanical vibrations and electronic signals. They are also prone to generate erroneous values when sensing the product feature characteristics in high-speed production lines, for example where the product frequently shifts from desired position. In other processes, such as extrusion lines, the extruded product may be stained, for example with cooling liquids, and this can cause sensors to misread. Erroneous sensor values can lead to inaccurate decisions, for example, raising false alarms on defects [1]. Additionally, noisy datasets are di cult to interpret by those not familiar with the system. Where noisy values are signi cant in the dataset, they inhibit machine learning performance and can also lead to false predictions [2]. It may be infeasible or expensive to change sensors or the system, and so decision makers are faced with no other option than to rely on noisy datasets. This article reports a machine learning-based methodology that was developed to correct a very noisy time series dataset of a multistage continuous manufacturing process. A dataset that was generated using a Manufacturing Execution System was used for the research. The 116-feature, 14,088-entries dataset included timeseries information for: ambient factory conditions; material property speci cations; machine operating parameters and dimensional characteristics of the output exiting from each of the two stages. In addition to the dataset being multivariate and of medium dimensionality, it contained a lot of erroneous values. The challenge was to correct the noisy values. In the process, a methodology was developed to guide decision makers in manufacturing systems on how to overcome the challenge of working with very noisy datasets.
Consequently, this article is structured as follows: Sect. 2 introduces the research challenge; Sect. 3 presents related works; Sect. 4 describes the materials and methods; Sect. 5 presents the developed methodology and explicates it with the dataset described previously; Sect. 6 discusses the implications of the research and concludes with future directions.

Research Challenge
In early 2020, a dataset was posted by Liveline Technologies, on the Kaggle [3], which hosts open source dataset repositories for different types of datasets representing a variety of systems and problem types. The timeseries dataset was posted to elicit responses from data modellers on how accurately they can predict values into the future. The dataset was found to be expedient in other ways. Firstly, the dataset comes from a real-life multi-stage continuous ow manufacturing process, so mining it would generate real-life insights. Secondly, the system that the dataset represents is typical of many manufacturing system types as it has machines in parallel and in series as well as assembly stations. Additionally, it is a time series dataset which typi es the type of data that is streamed and batched in many manufacturing systems. The results of a machine-learning based modelling approach could be generalizable for many types of manufacturing systems. The dataset contained a lot of erroneous values. It is possible that the original dataset was perturbed by the dataset providers, but erroneous data entries occur in real-life manufacturing systems datasets. The occurrence may be infrequent or frequent, and consequential. Documenting the process of correcting a noisy dataset would be bene cial to decision makers in manufacturing systems who are at any time faced with dealing with such datasets.
The dataset, the description of the data sampling rate and the background information about the system are available at Kaggle.com. Based on the information disclosed by the dataset providers, the system is a two-stage continuous ow manufacturing process, which can be depicted as shown in Fig. 1 [4]. The raw dataset (in csv format) consisted of 116 variables (column labels) and 14,088 data observations (row numbers). As imposed by the dataset providers, the key variables of concern are the 15 stage 1 output measurements and 15 stage 2 output measurements (see Fig. 1). These measurements were given in mm, suggesting that they are dimensional features.
The dataset owners provided information that the dataset was very noisy (see Fig. 2 excerpts for time series plots of some of the variables). It was observed, from the raw dataset, that there is value variation in some key process output features, where the values are required to be xed (see example charts in Figs. 2a and 2b). This is an indication that the values may be noisy. Dataset providers con rmed this. A lot of the values are also zero, suggesting they are inaccurate readings, since dimensional features must take on positive values. So, the main challenge was to de-noise and correct the dataset.

Related Works
Noise correction in datasets Page 4/28 Noise as it relates to data is the existence of irrelevant and erroneous values in a dataset. Examples of irrelevant values are data points or attributes with constant (or near constant), non-changing values. An attribute with serialized values such as row ID numbers or time stamps, where these are not needed for a data mining task at hand, may also be considered as noise. Errors in a dataset can occur in various forms such as extreme and unrealistic values, missing values and suspicious values such as negative values. Classi cation of noise is generally case-speci c [5], and data modellers need to establish what is noise and what is not noise in the dataset.
Real-life datasets are often noisy [6], to varying degrees such as low, moderate, high and extreme [7]. There are as many as four commonly occurring types of patterns for erroneous values in large dataset [8]. In this research, the focus is not on any of the commonly occurring types of erroneous patterns, but on the percentage for an attribute in a given m x n dataset matrix, where m is the number of attributes or columns and n the number of instances or rows. Table 1 can be used to explain this. In the Table 1, ν are the valid values while ε represent erroneous values. Attributes A1 and A3 have less than 5% erroneous values and can be considered as having low noise. Attributes A2, A4, A5 and A7 may be termed as moderately noisy, A6 has high noise, while A8 is extremely noisy. Noise may be inevitable is some datasets, for example, sensors which are the most ubiquitous data source in many data driven decisions, are known to generate erroneous data as a result of mechanical and electronic disturbances [7,9,10]. If the system is such that the sensors are very sensitive, then such systems may regularly generate noisy datasets where mechanical disturbances and electrical uctuations are regular. Signi cant noise in a dataset can lead to erroneous statistical inferences which may misinform decision makers. Where noise ranges from moderate to extreme, it can negatively affect the speed and accuracy performance of machine learning algorithms [11]. Robust approaches are need when handling noisy datasets.
Data cleaning is a time consuming and effort demanding task in a machine learning process. Data cleaning can be approached using integrity constraints methods [12], statistical tools [13] and machine learning models [14]. A number of options exist for addressing erroneous values in a multivariate dataset [15]. In one method, data instances with error values can be discarded, but this may signi cantly reduce the size of the dataset where the dataset is small to medium size and where the noisy instances are substantial. If columns or attributes with erroneous values are discarded [as in 16], decision makers may miss some critical attribute relationships and the dynamics that take place in the system. Additionally, the expunged readings may hold valuable time-related information, and so cannot be excluded for a particular decision. Another method is to replace erroneous values with mean or mode values for the attribute that has the erroneous values, but this may lead to wrong statistical or learning inferences where a signi cant number of values are replaced. Yet another method is to use a non-noisy surrogate dataset [17], to replace the erroneous values. The surrogate dataset can come from the same system, such as from prior data collection. This may not be an option if the system is prone to generating noisy data (as it is the case with sensors), or if the conditions are different. A fourth way of dealing with erroneous values is to mine them using a machine learning approach, either learning with noise [7] or replacing the erroneous values [14,18]. Replacing erroneous values enables a reconstructed dataset to evolve. A reconstructed dataset can be of value in other ways aside the prediction of some values.
This research attempts to correct a noisy dataset using machine learning procedure. Machine learning can be used to provide an accurate replacement value for an erroneous value in a very large dataset. It can also enable it to be done e ciently if the replacements are many, say in the thousands. While decision makers in manufacturing systems are sometimes faced with very noisy datasets, there is a paucity of studies to guide them on how to navigate working with such datasets using a machine learning approach.

Materials And Methods
Feature selection modelling A feature (or attribute) is an individual measurable property of the system that is being investigated [19].
In a dataset, it is referenced using the column labels. The focus of feature selection is to select a subset of variables from the input which can e ciently describe the input data while reducing effects from noise or irrelevant variables and still provide good prediction results. Feature selection is the systematic reduction of the feature space of a dataset. The key objective is to improve the accuracy and cost of the learning model [7,20]. In feature selection, only those features (variables or attributes) that show high sensitivity to the target output are retained [21]. Others are considered as noise or irrelevant to a given machine learning task [22]. Other bene ts include reducing the dataset size for storage [23] and to pinpoint key input variables that affect an output variable [24]. Feature selection or reduction allows the application of more complex algorithms that would otherwise be infeasible where the full dataset is used [25].
Examples of feature selection algorithms that have been widely used within a manufacturing system context include Principal Component Analysis [26], ReliefF [27] and Correlation Based Filter Selection [28]. These feature selection algorithms have their speci c uses and scope, bene ts and limitations and types of dataset they can handle. For instance, Principal Component Analysis is very useful for extracting relevant features from high dimensional data such as image data [29], while Correlation Based Filter Selection works well on numeric data [30]. ReliefF and other tree-based algorithms are able to handle noise better than others ( [31].
In its simplest form, a feature selection work ow can be described as shown in Fig. 3. The work ow describes the use of a lter-based algorithm that uses a ranking scheme to rank features from most relevant to least relevant, in a supervised manner.
As shown in the Fig. 3, the dataset containing all features is evaluated using candidate feature reduction algorithms. The subset is then evaluated using candidate ML models [19,32] or one that has been chosen previously. For ranking-based feature selection approach, different cut-off points for the ranked features are chosen and evaluated to identify the optimal cut-off point beyond which learning performance becomes marginal. Selected features are not generalizable for the entire dataset. They are speci c to the target or output attribute, and so in a multi-target attribute dataset, different feature sets need to be extracted for each target [33]. In other words, the process ( Fig. 3) is repeated as many times as a target attribute is to be predicted.
There is a trade-off between reduced feature space dimensionality and prediction accuracy [16]. Data modellers have to choose whether to proceed with the full set or a reduced set, depending on the prediction accuracy of using the reduced set in comparison to using the full set. In a very large dataset with thousands of features, it is good practice to exclude non-informative and irrelevant features rst [34] as this can enhance the feature selection process. Other pre-processing tasks that enhance feature selection modelling include using clean instances of the dataset [35] and data type transformation, such as normalization, discretization and nominalization, as would normally be ful lled prior to classi cation, clustering and association modelling, for example.

Classi cation modelling
Classi cation prediction remains the most utilized machine learning technique [32], ahead of clustering and association rules, even in the context of manufacturing systems [36]. Classi cation prediction involves the process of using observations of a known group to estimate observations of an unknown group [9,37], on the basis of the least estimation error [38] and modelling cost [39]. Accuracy of prediction or estimation error is based on measurement error (or distance) between the predicted and observed actual value. Cost of modelling is a function of the total time to train and test the model.
Classi cation prediction, as it is with other machine learning methods, is ful lled in two main stages [40].
In the rst stage, the classi er model (algorithm and speci ed parameters), is built by training it to memorize observations and patterns of a portion (train set) of the known group. In the second stage, the trained model is validated on the other portion of the known group (test set) to estimate the accuracy rate when using the classi er model on an unknown group. The use of the classi er model on an unknown group is based on the known group and unknown group being homogenous in terms of identical number and type of attributes, and representing the same system.
Neural Networks, Linear Regression, Support Vector Machines and Tree-based Classi ers, with their variants, have been the most prevalent learning algorithms in use in manufacturing [41]. These classi er learning algorithms handle data differently [41], deal with noise differently [42], and respond differently to data types. For instance, tree-based classi ers work well with nominal data types, they are also more responsive to certain data types than with others. For example, Neural Networks and Linear Regression models are well suited for numeric data types; J48 learner accepts only nominal-type data (such as discrete and categorical), and Random Forest works is able to handle both numeric and nominal data types. Tree-based classi ers can handle small to medium sized datasets very well [43,44]. The tree-based classi ers such as Decision Tree and Random Forest have been the dominant classi ers in machine learning-based noise correction tasks. Some algorithms, such as Neural Networks, Logistic Regression and Support Vector Machine, perform badly when data is not normalized and where scale differences exist amongst features. The tree-based classi ers are rule-based, and so do not require pre-processing through data normalization. Having foreknowledge of these information helps data modellers narrow down the candidate classi er algorithms to test on a dataset, and not test the full spectrum.
In data modelling with classi cation learning, it is common to use a boosting method such as an ensemble of multiple classi ers [9] in order to improve prediction accuracy. Random Forest classi er is an ensemble of tree-based classi cation models, and is known to be more accurate than J48 and Decision Tree [6]. Feature reduction has also been used to boost the performance of classi er models.
Software tools for building machine learning models Identi cation and selection of software tools is rarely reported in detail [45]. Familiarity with the software, application domain of the software, versatility of the software in terms of data types and formats and learning algorithms handled, ease of communicating with other software and other factors are used to identify and select candidate machine learning software tools [45]. Software tools with Graphical User Interface (GUI) such as KNIME and WEKA have been used extensively in data driven decisions relating to manufacturing systems [45]. These and other GUIs enable machine learning models to be built quickly with little or no programming code input. One major limitation of these GUIs is that programming codes may be needed for data transmission between the software and the database that warehouses the dataset. Additionally, they individually have their speci c limitations. It is therefore possible that one may need to rely on codes written in C, Python, Java or R language, to automate some tasks such as data transmission and text extraction.
The use of a relational database is inevitable when learning on batched data, since batched data are by nature stored data. SQL, NoSQL and other relational databases are common in manufacturing (Ismail et al., 2019). These databases provide the necessary structure for storing and organizing data to facilitate quick and easy access and manipulation. Microsoft Excel has also proven useful [45]. Relational databases have the type of structure (columns and rows) that is responsive to queries and manipulation using programming codes. In the current study, WEKA was used for the machine learning tasks. Python was used to generate computer codes to enable data pre-processing, data parsing and data extraction of relevant information. Microsoft Excel was used as the database and for opening csv data le formats.
Methodology for correcting noise in multiple attributes in a dataset.
The methodology that was developed is depicted in Fig. 4 The rst three steps represent the data exploration and pre-processing of the data, including data transformation. The subsequent seven steps are the prediction learning steps. In the methodology, prediction learning is ful lled one attribute at a time. In a situation where there are multiple attributes to be corrected, these steps are repeated for each attribute that is noisy. The nal step involves replacing the missing values with the predicted values, which culminates in the corrected dataset.
In the developed methodology, classi cation was used. A clustering or association mining approach can also be used. In order to reduce the feature space of the dataset and boost the performance of the classi cation model, feature reduction modelling was applied. Classi cation modelling would therefore serve two purposes to: a) validate the ability of few selected features to represent the entire feature space and b) to predict replacement values for erroneous values. For the above stated reasons, the prediction learning aspect is majorly based on feature selection and classi cation prediction modelling methods.
The methodology that is depicted in Fig. 4 was developed while correcting the noisy dataset described in Sect. 2. It was developed through an iterative process that tested other possible pipelines [45]. In the next section, the developed methodology is re-applied to the same noisy dataset to explicate it as a reusable approach.

Results
Explore data and establish noise Data exploration was ful lled using Panda, a data analysis tool for python programming. Time series plots of attributes were analysed for trends and anomalies. Many of the stage 1 and stage 2 output measurements exhibited trends similar to those shown in Fig. 2. In the dataset, noisy values were identi ed as values outside of a user-speci ed range, including zeros and outliers.
A correlation matrix was generated to determine if attributes were strongly, moderately or weakly correlated. Strong correlations are an indication that one in a pair of attributes may be redundant. It is also an indication that statistical tools can be used to explain pairwise relationships. For example, if stage 1 measurement output and stage 2 measurement output are highly correlated, then either can be used to explain the other, through simple statistical expressions. The correlation matrix however indicated mainly weak correlations, see correlation coe cients shown in matrix of Fig. 5.

Convert noisy values to missing values
Converting noisy values to missing values was undertaken using Pandas and NumPy python-based libraries, which were also used to explore the raw dataset. Focus was on those stage 1 and stage 2 output measurements that exhibited signi cant noisy values. User-speci ed range of correct values was used to sift out erroneous values falling outside of the range. Table 2 shows the attributes with missing values. Remove non-relevant attributes Some attributes do not add value to a learning task. When included in the dataset for learning, it hinders the model performance. For the learning task at hand, the time stamp attribute was not needed. In addition, the set point measurements in the dataset were also not needed, since they were constant, nonchanging values. If the correlation analysis had established correlation between any two attributes, one can be discarded as being redundant.
Attributes with signi cant missing values can also be discarded. Attributes with more than 70% missing values fall into this category. The reasoning behind this is because the non-missing which would have been used to predict the missing values is insu cient to comprehensively describe the dataset pattern for the entire set of values for the attribute. Three attributes in stage 1 output measurements (5, 6 and 11) and one in stage 2 output measurements (4) fell into this category.

Set class attribute
A class attribute can be described as the particular attribute whose values are to be predicted, also known as the target, dependent or output variable. In the dataset being worked on, there are multiple class attributes since there are multiple attributes whose values are noisy, something that is common in real life datasets. This is a classic multi-target classi cation problem, where many attributes are considered as the target. There are two known approaches for treating this type of multi-class classi cation problem (Last et al., 2010). One is to build a separate model for each target. The second approach is to use a multi-target classi cation algorithm [46], such as Random Linear Target Combination and Multi-Objective Random Forest [47]. Using single-target classi ers for a dataset that has many class attributes would be an arduous and time-consuming task, if one is to build a separate model for each class attribute. Despite this, they perform just as well [48] and even sometimes better [47] than multi-target classi ers. Moreover, single-target classi cation approach is more prevalent and more straightforward [48]. For the few numbers of attributes that are to be corrected in the current dataset, a single-target classi cation approach is su cient.
In this research, the prediction learning is on the basis of a single-target classi cation approach, hence steps four to ten (see Fig. 4) is done repeatedly for each of the class attribute that is to be corrected for noise. The approach is a systematic one where attributes with erroneous values are corrected, one attribute at a time [49] Exclude non-relevant attributes The dataset being corrected represents a multi-stage manufacturing process. Speci cally, the system is a two-stage continuous ow manufacturing process. It is better to split the class-speci ed dataset into relevant and not-relevant sets to improve learning performance [50]. With reference to Fig. 1, relevant to predicting stage 1 output measurements are those variables upstream of stage 1 output measures. Stage 2 processes and stage 2 output measures are not relevant, since they are downstream of stage 1 output measurements. Similarly, stage 1 output measurements and stage 2 processes are relevant for predicting stage 2 output measurements. Including stage 1 process parameters in the set for predicting stage 2 output measurements will be a redundancy, because stage 1 output measures are the results of stage 1 process parameters.
Bearing this in mind a dataset is carved out of the class-speci ed dataset. The generated dataset includes the class attribute as well as those attributes that represent process parameters upstream of the attribute. As an example, if stage 1 output measurement 2 (S1M2) has been chosen as the class attribute, then the dataset is pruned to exclude all other stage 1 output measurements and stage 2 process attributes with stage 2 output measurements. On the other hand, if stage 2 output measurement 7 (S2M7) is selected as the class attribute, the dataset includes the stage 1 output measurements and stage 2 attributes, but excludes other stage 2 output measurement attributes. The stage 1 output measurements should have been corrected prior to this, as noisy attributes can cause prediction errors. It is possible to use only the stage 2 process attributes (representing machines 4 and 5) to predict the stage 2 output measurement values, but by including stage 1 output measurement attributes, it is assumed that stage 2 output is formed from stage 1 output. In other words, stage 1 output measurements can be used to explain stage 2 output measurements. If a stage 3 process existed, the same analogy is applied, and so on in subsequent stages. To eliminate confusion, it is best that the dataset attributes are ordered is such a way that the ordering represents the actual ow process.

Partition non-missing dataset
Although tree-based algorithms such as Decision Tree and Random Forest have been known to learn noisy datasets very well [42], performance is often degraded where noise exists in a dataset. As a result, the methodology proposes using a clean (not missing values) dataset to build and test the prediction model. In order to do this, a dataset that has non-erroneous (non-missing) values for the class attribute is carved out of the class-speci ed dataset (which has both missing and non-missing values). This dataset is progressed for building the classi er model for predicting the class attribute. The built model would be applied to the class-speci ed dataset to predict values for the missing values. The data partitioning was accomplished in the WEKA explorer environment.

Apply feature reduction modelling
In the methodology, the main function of feature reduction is to boost the performance of the classi er model. To con rm that the few selected features can adequately represent the entire feature space, a classi er model is applied to the feature-reduced dataset (as described in Fig. 3). The performance is compared with learning using the full feature space. If prediction accuracy is not degraded signi cantly but modelling cost is improved, then the selected features are capable of representing the entire feature space, otherwise, the full dataset is used.
For the dataset being corrected, a feature reduction model was built on the basis of ReliefF feature selection algorithm [51,52]. ReliefF ranks features from most sensitive to least sensitive to the class attribute. Data modellers can then decide which cut off point to test on the classi cation model. Principal Component Analysis was ruled out as it is computationally costly. Additionally, it uses orthogonal transformation to map features that may be correlated into a new feature space where there is minimum correlation between the new features, thereby creating new features from existing ones [16]. Neural Networks (NN). The results (see Fig. 6) indicate the optimal number to be the rst 10 ReliefF selected attributes, beyond which there is marginal or no improvement in prediction accuracy. Table 3 has been used to compare a Random Forest classi er model prediction accuracy using: a) the full dataset; b) the top 10 ReliefF (missing and non-missing) and c) the top 10 ReliefF using non-missing set only. From the results presented in Table 3, it can be concluded that there is improvement in the prediction accuracy when using the ReliefF reduced subset instead of using the full dataset. There is only very marginal improvement of 2.2% by using the non-missing (clean) subset instead of using the subset that contains both missing and non-missing values for the class attribute. For the current dataset problem, there is therefore no need to set apart a clean subset to be used in building the classi er model, as this would add to the computational steps and time. The partitioning step 6 in the methodology (see Fig. 4) is therefore bypassed for the current dataset, but would be needed for a dataset where missing values signi cantly degrade the performance of the built classi er model.
The results indicate that the top 10 ReliefF selected attributes are able to represent the entire feature set for predicting S1M1, and are also able to boost the performance of the classi er model. The prediction results for the three candidate classi ers (see Fig. 4), suggest that a Random Forest-based classi er model is the best learner for the dataset. On this basis, the classi cation prediction is progressed on a hybrid modelling approach namely, ReliefF to reduce the feature space and Random Forest for value prediction. Build classi er model A Random Forest classi er model was used to evaluate the feature selection subset. Random forest algorithm is a combination of tree-based classi ers, such that after a large number of trees is generated, they vote for the most popular class [43]. Random Forest by nature is an ensemble i.e. consisting of multiple classi ers (tree models), the accuracy of which has been shown to be quite high for small to medium sized datasets [53]. The drawbacks for Random Forest are speed de ciencies as a result of combining multiple classi ers. In the current research the dataset is considered small to medium sized and so speed de ciency was not noticed. Moreover, the feature space has been reduced through feature reduction, which is supposed to boost the speed performance of the classi er model. Another drawback with Random Forest, like other tree-based algorithms, is that they cannot provide good estimations outside the boundaries of the training dataset, in other words they extrapolate poorly, unlike regression algorithms [53]. The current prediction task does not warrant extrapolating the prediction forecasts outside the boundaries of the current dataset, so a tree-based classi er is su cient for the present purpose.
The dataset used for training is the top 10 ReliefF reduced dataset which included missing and nonmissing values for the class attribute: an 11 attribute x 14,074 instance dataset. The dataset instances were partitioned 80:20 into train and test sets, using random sampling. Mean absolute error was used as the evaluation metric. It is a prediction accuracy metric for classi cation models that learn numeric data types. It is indicative of the error variance between the predicted values and the actual values. It is given by the Eq. 1, where e i is the prediction error of the i th sample, and n is the number of samples.
Values of mean absolute error closer to 0 show good prediction capability of the model. Mean absolute error for the Random Forest classi er model for predicting Stage 1 output measurement 1 (S1M1) values, using the top 10 ReliefF selected attributes, was 0.0363, see Table 3. The model is saved to be applied for missing value imputation.
Apply built classi er model on class-speci ed dataset The saved model is re-applied to the same dataset, but this time without partitioning the dataset instances. The reason for this is so that the dataset instances are not disordered as is the case with random sampling and partitioning. Re-ordering a disordered dataset after generating prediction results, would add to the computational complexity of the machine learning process.
An excerpt of the prediction results (instance 1 to 4 and 9671 to 9689) is shown in Fig. 7. Under the column with actual values, '?' denotes a missing value. These results are saved to a text le to enable extraction of relevant information. Figure 8a is a plot of the prediction error for predicting S1M1. The points lying along the 0 axis are majorly the missing entries for the error values due to missing actual values. An analysis of the prediction error for the actual values showed that 99% of values lie below 0.05 mm and 56% below 0.01 mm (see Table 4). The results indicate that if 100 samples are taken, 99 can be predicted to an accuracy that is within 0.05 mm, using the model. Figure 8b shows the scatter plot for predicted vs actual values. Most data points lie along the line of best t, indicating good prediction accuracy of the model. Extract predicted values A python-based code was used to extract and organize the relevant information from the WEKA generated result. The logic in the code created a csv le for the results. Where the actual column value is '?' the predicted value is selected, otherwise the actual value is selected. By so doing, the column in the generated csv data contains the actual values and the predicted values for the class attribute. The code is shown in Table 5. The code truncates information (by extracting only relevant column values) and parses information (from text to csv). And steps four to ten are applied accordingly.
It is important to note that missing values of less than 5% are considered trivial [54,55]. They are insigni cant to cause any major performance degradation or cast suspicion on the results of applying a machine learning algorithm. And so, attributes with less than 5% missing values were not corrected for this research.

Discussions And Conclusions
The methodology yields a number of bene ts for industry practitioners. If every noisy instance were to be discarded in a dataset, decision makers may be left with a much-reduced dataset for decision making.
With the methodology, dataset owners and users in manufacturing systems and systems alike, do not need to discard very noisy attributes or noisy instances. The methodology helps in the correction of erroneous values in a dataset.
Data driven predictions along a multi-stage manufacturing process has been researched [56][57][58]. There is a paucity of studies that guide practitioners on how to navigate the entire process of mining a multistage manufacturing process dataset. The studies do not explicate data exploration, pre-processing, dataset partitioning, model building, results parsing and cleaning and analysis. The methodology and the study reported in this article bridges the above stated knowledge gap.
A manufacturing process can be de ned as a sequence of interlinked operational activities in a manufacturing system, where each activity leads to the next, and the ow of activities forms a whole [59].
Manufacturing is generally ful lled in stages. It was shown in this article that each stage in a manufacturing process can be used to demarcate relevant process variables that should be progressed to build a learning model. This helps in pruning the dataset so that model performance is enhanced. Data modellers for multi-stage process learning would nd this very useful where there are thousands of process variables and tens to hundreds of stages.
Machine learning is an enabler for knowledge discovery. The methodology uses both feature engineering and classi cation modelling methods. Both methods on their own are able to extract complex patterns in large datasets that help provide a deeper understanding of the system that generated the dataset. The feature reduction component of the methodology was able to indicate those few attributes that can be used to explain each output measurement. This is a noise reduction strategy, as it reduces the feature space to only those that are most relevant to explaining the target attribute. Then, decision makers can use such information to understand the core relationships in their system. The classi cation prediction modelling process to replace missing values is similar to the process one would follow when building a classi er prediction model for predictive process monitoring. The prediction models built to replace missing values can also be taken as the prediction models to use in subsequent prediction tasks. In other words, there is no need to build another model, rather the task would be to improve training by increasing the number of training samples, which ultimately leads to an increased prediction accuracy of the model.
The methodology presented in this article represents noise correction of batched data. The built models can be injected into streaming data to enable real-time predictive process monitoring [60]. In addition, the process and model building steps can be collapsed into one algorithm to signi cantly reduce the time taken to manually correct the dataset. This advancement to the methodology is being undertaken by the author.
[11] list some likely drawbacks of replacing noisy values with values predicted using a classi er model.
They argue that it carries a high computational cost. This is true, considering the number of models to build when there are many attributes to correct for erroneous values. A multi-target algorithm can be considered as a likely alternative worth testing and is an area for further research.
During the analysis, it was found that prediction accuracy drops around the vicinity of erroneous or missing values. This nding has implications for real-time predictive process monitoring. Managers should be aware that prediction may degrade where there are many erroneous values prior to the value being predicted. This is another area worthy of further research.
There is a paucity of the type of methodology presented in the current research. To the best of the author's knowledge, it has not been reported in the literature. The methodology is generalizable in manufacturing systems and can be applied to non-manufacturing related datasets. In this article, the methodology has been described with reproducibility in view. In addition to this, the software tools are generic and opensource. Figure 1 Process ow representing the multi-stage continuous ow production system.   Flow process for correcting noise in multiple attributes in a dataset Prediction accuracy performance comparison of top ranked features for predicting stage 1 output measurement 1 Figure 8 Plot for stage 1 output measurement: a) prediction error for each instance b) scatter diagram for predicted vs actual values.