Updated March 24, 2023

Introduction to Partitioning in Hive

Partitioning is a feature in Hive similar to RDBMS, making querying large datasets much faster and cost-effective. Partitioned tables are logical segments of large data tables based on one or more columns. This makes analyzing data much easier as only relevant subsets can be further investigated for deriving insights. This notion of partitioning is an old one, distributing the load horizontally and moving data closer to the user. Both external and managed (or internal) tables can be partitioned in Hive. Further, bucketing can be done using CLUSTERED by columns on these tables for improved query performance for certain queries.

Creating Data into Hive Tables

Let us create a table to manage “Wallet expenses”, which any digital wallet channel may have to track customers’ spend behaviour, having the following columns:

Month	String
Spender	String
Merchant	String
Mode	String
Amount	Float

To track monthly expenses, we want to create a partitioned table with columns month and spender.

CREATE TABLE expenses (Month String, Spender String, Merchant String, Mode String, Amount Float ) PARTITIONED BY (Month STRING, Spender STRING) Row format delimited fields terminated by ",";

We get to know the partition keys using the below commands.

Commands:

describe formatted expenses; show partitions expenses;

Partition keys behave like regular columns, once created, where users need not care whether it is a partitioned column or not unless optimization is required. Also, table schema need not have partition columns specified again as partitions create pseudo columns to query on.

Inserting Data into Hive Tables

Data insertion in HiveQL table can be done in two ways:

1. Static Partitioning

In static partitioning mode, we insert data individually into partitions. Each time data is loaded, the partition column value needs to be specified.

To insert value to the “expenses” table, using the below command in strict mode.

Command:

INSERT INTO TABLE expenses PARTITION (month= ‘201901’, spender = ‘PAY1001’) SELECT month, spender, merchant, mode, amount FROM expenses WHERE month=‘201901’ and spender = ‘PAY1001’;

Considering the table “expenses”, if there are 12 months and 100 spenders, then 12*100 = 1200 single insert statements will be written to insert all the table values.

There are certain types of query which are not allowed to run in MapReduce strict mode, i.e. when hive. mapred.mode = strict. These include:

Order by without limits
Cartesian product
Comparing bigints and doubles
Comparing bigints and strings
No partition key being picked up in a query.

2. Dynamic Partitioning

In dynamic partitioning mode, data is inserted automatically in partitions. It identifies the partition column values to be inserted. By default, Hive allows static partitioning, to prevent creating partitions for tables by accident. To set Hive to dynamic/unstrict mode, certain properties need to be explicitly defined.

Properties

hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> set hive.exec.max.dynamic.partitions.pernode=1000; //sets the maximum number of dynamic partitions which a mapper or reducer can create, default value is 100.

After the dynamic properties are set as above, to insert value to the “expenses” table, below is the command.

Command:

INSERT INTO TABLE expenses PARTITION (month, spender) stored as sequencefile SELECT month, spender, merchant, mode, amount FROM expenses;

OVERWRITE command is used to overwrite the partition column values and replace them with new content. The whole table will be dropped on using overwrite if it is a non-partitioned table. INTO command will append to an existing table and not replace it from HIVE V0.8.0 and later.

Command:

INSERT OVERWRITE TABLE expenses PARTITION (month, spender) stored as sequence file SELECT month, spender, merchant, mode, amount FROM expenses;

Commands Used on Partitions in Hive

Below are some of the important commands used on partitions:

1. ALTER Partitions

There can be instances where the partitions created in a table need to be renamed or deleted or added ( same as an insert).

Command:

ALTER TABLE expenses PARTITION (spender = PAY1001) RENAME TO PARTITION( spender = PAYP1001)

We can verify this change by running the “SHOW PARTITIONS” command on the table. Partition column value changes; however, the metadata underlying it remains the same.

Command:

ALTER TABLE expenses DROP IF EXISTS PARTITION (month = 201902)

The partition gets deleted using this command. Both sub-directory and metadata are deleted in case of internal or managed tables.

2. EXCHANGE Partitions

If we have created partition in one table in expenses, it can be moved to another table customer with the same scheme does not have this partition present.

Command:

ALTER TABLE customer EXCHANGE PARTITION (spender) WITH TABLE expenses

3. TOUCH Partitions

The purpose of using this command is to read the metadata and write it back. It is widely used to log or fire hooks in case the table or partition is modified.

Command:

ALTER TABLE expenses TOUCH PARTITION (month, spender)

4. TRUNCATE Partitions

Truncate is used to remove a table or partition even from Trash, as this is similar to using PURGE.

Command:

TRUNCATE TABLE expenses PARTITION (month, spender)

5. PARTIAL Partitions

To change any existing partitions at once by using a single ALTER table statement, so that we don’t need to write multiple such statements, partial partitioning can be used.

We need to set hive.exec.dynamic.partition = true, to enable partial partitioning specifications.

Command:

ALTER TABLE expenses PARTITION (month, spender) CHANGE COLUMN amount amount DECIMAL(38,18)

Advantage and Limitation of Partitioning in Hive

Here are the advantage and limitation of Partitioning in hive explained below:

Advantages: Tables are stored in parts/segments making query response time faster as manipulation or search is required on a small segment rather than traversing the whole table.

Limitation: Too many partitions increase the overhead on name nodes as all metadata is stored in memory only. Each MapReduce job may end up having a huge volume of tasks (running in separate JVMs) due to a large number of partitions in the MapReduce execution engine.

Conclusion

We have got a fair idea of why partitioned tables will be more useful for large data sets with logical segments to be delved into. Widespread use case of partitions is analyzing time-series trends for customers, spending behaviour on specific Merchant categories, industry-wise profit trends, etc. Hive makes partitioning easy by abstracting the details for the users.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

Introduction to Partitioning in Hive

Creating Data into Hive Tables

Inserting Data into Hive Tables

1. Static Partitioning

2. Dynamic Partitioning

Commands Used on Partitions in Hive

1. ALTER Partitions

2. EXCHANGE Partitions

3. TOUCH Partitions

4. TRUNCATE Partitions

5. PARTIAL Partitions

Advantage and Limitation of Partitioning in Hive

Conclusion

Recommended Articles

Follow us!

APPS

Blog

Courses

Email