Updated February 28, 2023

Introduction to HiveQL Queries

HiveQL has a SQL like a dialect used for summarizing and querying large chunks of data through Hadoop environment. Hive is widely used to alter, create and drop tables, databases, views or user-defined functions by big data professionals. Some of the data definition language (DDL) used to load data and modify it in the database are Create, Alter, Show, describe, describe formatted, drop, truncate.

Data Types in Hive

As in relational databases, Hive supports most of the primitive data types and also three collection data types.
Primitive data types are Integer, Boolean, float, double, String, Timestamp and Binary.
Within Integer, Hive supports varying sizes like tinyint, smallint, int and bigint.
The collection data types are structs, maps and array.
A struct is analogous to a C struct.
“Dot” notation is used to access the fields with this data type.
A map is a collection of key-value tuples.
Array notation(e.g. [‘key’ ]) is used to access the fields with this data type.
An array is a collection of elements with the same data type which can be accessed using zero-based integer indexes.

Types of HiveQL Queries

Given below are the types of HiveQL queries that are widely used:

1. HiveQL query for information_schema database

Hive queries can be written to get information about Hive privileges, tables, views or columns. Information_schema data is a read-only and user-friendly way to know the state of the system similar to sys database data.

Example:

Code:

Select * from information_schema.columns where table_schema = ‘database_name’

This will retrieve all the columns in the database table specified.

2. Creation and loading of data into a table

The bulk load operation is used to insert data into managed tables as Hive does not support row-level insert, delete or update.

Code:

LOAD DATA LOCAL INPATH ‘$Home/students_address’ OVERWRITE INTO TABLE students PARTITION (class = “12”, section = “science”);

With the above command, a directory is first created for the partition, and then all the files are copied in the directory. The keyword “local” is used to specify that the data is present in the local file system. “Partition” keyword can be omitted if the table does not have a partition key. Hive query will not check for the data being loaded to match the schema of the table.

The “INSERT” command is used to load data from a query into a table. “OVERWRITE” keyword is used to replace the data in a table. In Hive v0.8.0 or later, data will get appended into a table if overwrite keyword is omitted.

Code:

INSERT OVERWRITE TABLE students PARTITION ( class = “12”, section = “science”) Select * from students_data where class = “12” and section = “science”

All the partitions of the table students_data can be dynamically inserted by setting below properties:

Set hive.exec.dynamic.partition = True;

Set hive.exec.dynamic.partition.mode = unstrict

Set hive.exec.max.dynamic.partition.pernode = 1000;

CREATE TABLE clause will also create a table, and schema will be taken from the select clause.

3. Merge data in tables

Data can be merged from tables using classic SQL joins like inner, full outer, left, right join.

Code:

Select a.roll_number, class, section from students as a inner join pass_table as b on a.roll_number = b.roll_number

This will return class and section of all the roll numbers who have passed. Using a left join to this will return the “grade” for only pass students and “NULL” for the failed ones.

Code:

Select a.roll_number, class, section, b.grade from students as a Left join pass_table as b on a.roll_number = b.roll_number

UNION ALL and UNION are also used to append data present in two tables. However, few things need to be taken care of on doing so like, Schema of both the tables should be same. UNION is used to append the table and return unique records while UNION ALL returns all the records, including duplicates.

4. Ordering a table

ORDER BY clause enables total ordering of the data set by passing all data through one reducer. This may take a long time for large data tables, so SORT BY clause can be used to achieve partial sorting, by sorting each reducer.

Code:

Select customer_id, spends from customer as a order by spends DESC limit 100

This will return the top 100 customers with highest spends.

5. Aggregation of data in a table

Aggregation is done using aggregate functions that returns a single value after doing computation on many rows. These are count(col), sum(col), avg(col), min(col), max(col), stddev_pop(col), percentile_approx(int_expr, P, NB), where NB is number of histogram bins for estimation), collect_set(col), this returns duplicate elements after removing collection column.

The set property which helps in improving the performance of aggregation is hive.map.aggr = true.

“GROUP BY” clause is used with an aggregate function.

Example:

Code:

Select year(date_yy), avg(spends) from customer_spends where merchant = “Retail” group by year(date_yy)

HAVING clause is used to restrict the output from GROUP BY, which is done using a subquery.

6. Conditional statements

CASE…WHEN…THEN clause is similar to if-else statements to perform a conditional operation on any column in a query.

For example:

Code:

Select customer, Case when percentage <40 then “Fail” When percentage >=40 and percentage <80 then “Average” Else “Excellent” End as rank From students;

7. Filtering of data

WHERE clause is used to filter data in HiveQL. LIKE is used along with WHERE clause as a predicate operator to match a regular expression in a record.

8. Way to escape an illegal identifier

There is a way to use special characters or keywords or space in columns or partition names by enclosing it in backticks ( ` ).

Comments in Hive Scripts:

There is a way to add comment lines to the Hive script by starting it with the string ‘- -‘.

Example:

Below is the code to display students data.

Code:

Select * from student_table;

This only works in scripts, if pasted in CLI error messages will get displayed.

HiveQL supports a member of file formats like Textfile, Parquet, etc. The ORC (Optimized Row Columnar) format can support tables upto 300 PB as Hive supports ANSI SQL and ACID (Atomic, Consistent, Isolated and Durable) transactions.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

Introduction to HiveQL Queries

Data Types in Hive

Types of HiveQL Queries

1. HiveQL query for information_schema database

2. Creation and loading of data into a table

3. Merge data in tables

4. Ordering a table

5. Aggregation of data in a table

6. Conditional statements

7. Filtering of data

8. Way to escape an illegal identifier

Recommended Articles

Follow us!

APPS

Blog

Courses

Email