An efficient strategy for the collection and storage of large volumes of data for computation

Journal of Big Data

Table 1 Summary of advantages and disadvantages of the proposed approaches

	Advantage	Disadvantage
Approach 1 Data transformation occurs within the data pipeline	Well tested approach: typical scenario in most data analytics platforms	Complex: transformation logic is kept in the data pipeline so in the case of data pipeline replacement the transformation logic needs to be re-implemented Lost data authenticity: the data is transformed by the data pipeline so the raw data is lost
Approach 2 Data transformation occurs within the storage layer	Easy to migrate/replace: the transformation logic is moved to a centralised location so it is easier to migrate or replace the data pipeline Raw data is intact: meets regulatory standards of storing the raw data both before and after transformation	Complex: an intermediate job is required for transformation Large storage needed: both raw and transformed data are stored
Approach 3 Data transformation occurs within the analytics jobs	Clean and simple: no complexity added to the data pipeline Less storage needed: only raw data is stored Easy to migrate or replace: the transformation logic is moved to a centralised location	Increased execution overhead: the analytics job will transform the data Repetition: transformation will take place every time an analytics job is executed