From: An efficient strategy for the collection and storage of large volumes of data for computation
Advantage | Disadvantage | |
---|---|---|
Approach 1 Data transformation occurs within the data pipeline | Well tested approach: typical scenario in most data analytics platforms | Complex: transformation logic is kept in the data pipeline so in the case of data pipeline replacement the transformation logic needs to be re-implemented Lost data authenticity: the data is transformed by the data pipeline so the raw data is lost |
Approach 2 Data transformation occurs within the storage layer | Easy to migrate/replace: the transformation logic is moved to a centralised location so it is easier to migrate or replace the data pipeline Raw data is intact: meets regulatory standards of storing the raw data both before and after transformation | Complex: an intermediate job is required for transformation Large storage needed: both raw and transformed data are stored |
Approach 3 Data transformation occurs within the analytics jobs | Clean and simple: no complexity added to the data pipeline Less storage needed: only raw data is stored Easy to migrate or replace: the transformation logic is moved to a centralised location | Increased execution overhead: the analytics job will transform the data Repetition: transformation will take place every time an analytics job is executed |