From: Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
Data organization strategy | Data model | Attributes | Decrease in processing time | Decrease in CPU usage | Role of the attributes |
---|---|---|---|---|---|
Multiple partitioning | SS-P DT-P | “Od_Year” “S_Region” | Yes | Yes | Attributes are used as filters in the “where” conditions, and in the “group by” and “order by” clauses |
SS-P | “S_Region” “S_Nation” “S_City” | Yes | NA | Attributes are used as filters in the “where” conditions, and in the “group by” and “order by” clauses | |
Bucketing | SS-B DT-B | “Orderkey” | No | No | Attribute not used in the “where” conditions nor used for “group by” or “order by” |
DT-B | “Od_Year” “P_Brand” | Yes | Yes | Attributes are used as filters in the “where” conditions, and in the “group by” and “order by” clauses | |
SS-B | “Suppkey” | Yes (Hive) No (Presto) | No (SF = 100) Yes (Hive, SF = 300) | Attribute not used in the “where” conditions nor used for “group by” or “order by”. Attribute used for joining tables | |
SS-B | “Orderdate” “Custkey” “Suppkey” “Partkey” | No | No | Attributes not used in the “where” conditions nor used for “group by” or “order by”. Attributes used for joining tables | |
Partitioning and bucketing | SS-PB | “Od_Year” “Orderkey” | No | NA | Only “Od_Year” is used in the “where” conditions, and in the “group by” and “order by” clauses |
SS-PB | “S_Region” “Suppkey” | Yes | NA | Only “S_Region” is used in the “where” conditions. “Suppkey” is used for joining tables | |
SS-PB DT-PB | “Od_Year” “S_Region” “Suppkey” | Yes | Yes | “Od_Year” and “S_Region” are used in the “where” conditions, and “Od_Year” is also used in the “group by” and “order by” clauses. “Suppkey” is used for joining tables in the SS-PB scenario | |
DT-PB | “Od_Year” “P_Brand” | Yes | NA | Attributes are used as filters in the “where” conditions, and in the “group by” and “order by” clauses |