Skip to main content

Table 5 Data processing discussion

From: An industrial big data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities

Requirement

Discussion

Legacy integration

The simulation shows JSON encoded RAT and SPT measurements in the message queue. It is apparent that legacy integration was successful given the RAT measurement originated from a legacy source, but now resides in the same format as SPT

Cross-network communication

The unified message queue shown in the simulation illustrates how messages from across devices and networks are assimilated after a successful ingestion process

Fault tolerance

Cloud computing provides a fault tolerant environment for data processing due to its ability to scale resources based on demand. The data pipeline simulation depicts a high-throughput cloud infrastructure with data processing and message queue components. It is implied that these components reside in a highly distributed cloud service that provides fault tolerance across multiple compute nodes. Furthermore, the simulation clearly shows how the message queue decouples the data ingestion from data processing. This promotes additional fault tolerance by protecting the ingestion process from faults that originate from data processing components

Extensibility

The simulation illustrates the aggregation and contextualisation component that is responsible for preparing data for analysis. This is one example of processing that may be required for time-series data. As new processing needs arise additional components can subscribe to the message queue subscription service and begin processing in parallel with other components. This extension is facilitated by the decoupling of processing components and message queue using the subscription service

Scalability

The data processing simulation inherits its scalability from its cloud infrastructure. Many of the benefits of cloud computing have already been discussed with regard to fault tolerance. These same load balancing features facilitate scalable data processing in the pipeline by scaling compute resources based on demand (i.e. amount of processing required)

Openness and accessibility

The simulation shows data stored in a contextualised cloud repository after completion of the aggregation process. This repository uses a naming convention to identify a dataset (i.e. AHU1), object (i.e. RAT), and chronological association for accessing time-series data. To promote openness and accessibility this data is accessed using standard HTTP requests. Furthermore, the ability of the data pipeline to support additional data formats, standards and representations has been addressed in discussions regarding extensibility