From: Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
Data organization strategy | Data model | Attributes | Decrease in processing time | Decrease in CPU usage | Role of the attributes |
---|---|---|---|---|---|
Multiple partitioning |
SS-P DT-P |
“Od_Year” “S_Region” | Yes | Yes | Attributes are used as filters in the “where” conditions, and in the “group by” and “order by” clauses |
SS-P |
“S_Region” “S_Nation” “S_City” | Yes | NA | Attributes are used as filters in the “where” conditions, and in the “group by” and “order by” clauses | |
Bucketing |
SS-B DT-B | “Orderkey” | No | No | Attribute not used in the “where” conditions nor used for “group by” or “order by” |
DT-B |
“Od_Year” “P_Brand” | Yes | Yes | Attributes are used as filters in the “where” conditions, and in the “group by” and “order by” clauses | |
SS-B | “Suppkey” |
Yes (Hive) No (Presto) |
No (SF = 100) Yes (Hive, SF = 300) | Attribute not used in the “where” conditions nor used for “group by” or “order by”. Attribute used for joining tables | |
SS-B |
“Orderdate” “Custkey” “Suppkey” “Partkey” | No | No | Attributes not used in the “where” conditions nor used for “group by” or “order by”. Attributes used for joining tables | |
Partitioning and bucketing | SS-PB |
“Od_Year” “Orderkey” | No | NA | Only “Od_Year” is used in the “where” conditions, and in the “group by” and “order by” clauses |
SS-PB |
“S_Region” “Suppkey” | Yes | NA | Only “S_Region” is used in the “where” conditions. “Suppkey” is used for joining tables | |
SS-PB DT-PB |
“Od_Year” “S_Region” “Suppkey” | Yes | Yes |
“Od_Year” and “S_Region” are used in the “where” conditions, and “Od_Year” is also used in the “group by” and “order by” clauses. “Suppkey” is used for joining tables in the SS-PB scenario | |
DT-PB |
“Od_Year” “P_Brand” | Yes | NA | Attributes are used as filters in the “where” conditions, and in the “group by” and “order by” clauses |