Updated March 24, 2023
Introduction to Partitioning in Hive
Partitioning is a feature in Hive similar to RDBMS, making querying large datasets much faster and cost-effective. Partitioned tables are logical segments of large data tables based on one or more columns. This makes analyzing data much easier as only relevant subsets can be further investigated for deriving insights. This notion of partitioning is an old one, distributing the load horizontally and moving data closer to the user. Both external and managed (or internal) tables can be partitioned in Hive. Further, bucketing can be done using CLUSTERED by columns on these tables for improved query performance for certain queries.
Creating Data into Hive Tables
Let us create a table to manage “Wallet expenses”, which any digital wallet channel may have to track customers’ spend behaviour, having the following columns:
Month | String |
Spender | String |
Merchant | String |
Mode | String |
Amount | Float |
To track monthly expenses, we want to create a partitioned table with columns month and spender.
CREATE TABLE expenses (Month String,
Spender String,
Merchant String,
Mode String,
Amount Float
)
PARTITIONED BY (Month STRING, Spender STRING)
Row format delimited fields terminated by ",";
We get to know the partition keys using the below commands.
Commands:
describe formatted expenses;
show partitions expenses;
Partition keys behave like regular columns, once created, where users need not care whether it is a partitioned column or not unless optimization is required. Also, table schema need not have partition columns specified again as partitions create pseudo columns to query on.
Inserting Data into Hive Tables
Data insertion in HiveQL table can be done in two ways:
1. Static Partitioning
In static partitioning mode, we insert data individually into partitions. Each time data is loaded, the partition column value needs to be specified.
To insert value to the “expenses” table, using the below command in strict mode.
Command:
INSERT INTO TABLE expenses PARTITION (month= ‘201901’, spender = ‘PAY1001’)
SELECT month, spender, merchant, mode, amount
FROM expenses WHERE month=‘201901’ and spender = ‘PAY1001’;
Considering the table “expenses”, if there are 12 months and 100 spenders, then 12*100 = 1200 single insert statements will be written to insert all the table values.
There are certain types of query which are not allowed to run in MapReduce strict mode, i.e. when hive. mapred.mode = strict. These include:
- Order by without limits
- Cartesian product
- Comparing bigints and doubles
- Comparing bigints and strings
- No partition key being picked up in a query.
2. Dynamic Partitioning
In dynamic partitioning mode, data is inserted automatically in partitions. It identifies the partition column values to be inserted. By default, Hive allows static partitioning, to prevent creating partitions for tables by accident. To set Hive to dynamic/unstrict mode, certain properties need to be explicitly defined.
Properties
- hive> set hive.exec.dynamic.partition=true;
- hive> set hive.exec.dynamic.partition.mode=nonstrict;
- hive> set hive.exec.max.dynamic.partitions.pernode=1000; //sets the maximum number of dynamic partitions which a mapper or reducer can create, default value is 100.
After the dynamic properties are set as above, to insert value to the “expenses” table, below is the command.
Command:
INSERT INTO TABLE expenses PARTITION (month, spender) stored as sequencefile
SELECT month, spender, merchant, mode, amount
FROM expenses;
OVERWRITE command is used to overwrite the partition column values and replace them with new content. The whole table will be dropped on using overwrite if it is a non-partitioned table. INTO command will append to an existing table and not replace it from HIVE V0.8.0 and later.
Command:
INSERT OVERWRITE TABLE expenses PARTITION (month, spender) stored as sequence file
SELECT month, spender, merchant, mode, amount
FROM expenses;
Commands Used on Partitions in Hive
Below are some of the important commands used on partitions:
1. ALTER Partitions
There can be instances where the partitions created in a table need to be renamed or deleted or added ( same as an insert).
Command:
ALTER TABLE expenses PARTITION (spender = PAY1001) RENAME TO PARTITION( spender = PAYP1001)
We can verify this change by running the “SHOW PARTITIONS” command on the table. Partition column value changes; however, the metadata underlying it remains the same.
Command:
ALTER TABLE expenses DROP IF EXISTS PARTITION (month = 201902)
The partition gets deleted using this command. Both sub-directory and metadata are deleted in case of internal or managed tables.
2. EXCHANGE Partitions
If we have created partition in one table in expenses, it can be moved to another table customer with the same scheme does not have this partition present.
Command:
ALTER TABLE customer EXCHANGE PARTITION (spender) WITH TABLE expenses
3. TOUCH Partitions
The purpose of using this command is to read the metadata and write it back. It is widely used to log or fire hooks in case the table or partition is modified.
Command:
ALTER TABLE expenses TOUCH PARTITION (month, spender)
4. TRUNCATE Partitions
Truncate is used to remove a table or partition even from Trash, as this is similar to using PURGE.
Command:
TRUNCATE TABLE expenses PARTITION (month, spender)
5. PARTIAL Partitions
To change any existing partitions at once by using a single ALTER table statement, so that we don’t need to write multiple such statements, partial partitioning can be used.
We need to set hive.exec.dynamic.partition = true, to enable partial partitioning specifications.
Command:
ALTER TABLE expenses PARTITION (month, spender) CHANGE COLUMN amount amount DECIMAL(38,18)
Advantage and Limitation of Partitioning in Hive
Here are the advantage and limitation of Partitioning in hive explained below:
Advantages: Tables are stored in parts/segments making query response time faster as manipulation or search is required on a small segment rather than traversing the whole table.
Limitation: Too many partitions increase the overhead on name nodes as all metadata is stored in memory only. Each MapReduce job may end up having a huge volume of tasks (running in separate JVMs) due to a large number of partitions in the MapReduce execution engine.
Conclusion
We have got a fair idea of why partitioned tables will be more useful for large data sets with logical segments to be delved into. Widespread use case of partitions is analyzing time-series trends for customers, spending behaviour on specific Merchant categories, industry-wise profit trends, etc. Hive makes partitioning easy by abstracting the details for the users.
Recommended Articles
This is a guide to Partitioning in Hive. Here we discuss creating, inserting and commands used for partitioning in Hive along with their advantages and limitations. You may also look at the following articles to learn more –