Updated February 28, 2023
Introduction to HiveQL Queries
HiveQL has a SQL like a dialect used for summarizing and querying large chunks of data through Hadoop environment. Hive is widely used to alter, create and drop tables, databases, views or user-defined functions by big data professionals. Some of the data definition language (DDL) used to load data and modify it in the database are Create, Alter, Show, describe, describe formatted, drop, truncate.
Data Types in Hive
- As in relational databases, Hive supports most of the primitive data types and also three collection data types.
- Primitive data types are Integer, Boolean, float, double, String, Timestamp and Binary.
- Within Integer, Hive supports varying sizes like tinyint, smallint, int and bigint.
- The collection data types are structs, maps and array.
- A struct is analogous to a C struct.
- “Dot” notation is used to access the fields with this data type.
- A map is a collection of key-value tuples.
- Array notation(e.g. [‘key’ ]) is used to access the fields with this data type.
- An array is a collection of elements with the same data type which can be accessed using zero-based integer indexes.
Types of HiveQL Queries
Given below are the types of HiveQL queries that are widely used:
1. HiveQL query for information_schema database
Hive queries can be written to get information about Hive privileges, tables, views or columns. Information_schema data is a read-only and user-friendly way to know the state of the system similar to sys database data.
Example:
Code:
Select * from information_schema.columns where table_schema = ‘database_name’
This will retrieve all the columns in the database table specified.
2. Creation and loading of data into a table
The bulk load operation is used to insert data into managed tables as Hive does not support row-level insert, delete or update.
Code:
LOAD DATA LOCAL INPATH ‘$Home/students_address’ OVERWRITE INTO TABLE students
PARTITION (class = “12”, section = “science”);
With the above command, a directory is first created for the partition, and then all the files are copied in the directory. The keyword “local” is used to specify that the data is present in the local file system. “Partition” keyword can be omitted if the table does not have a partition key. Hive query will not check for the data being loaded to match the schema of the table.
The “INSERT” command is used to load data from a query into a table. “OVERWRITE” keyword is used to replace the data in a table. In Hive v0.8.0 or later, data will get appended into a table if overwrite keyword is omitted.
Code:
INSERT OVERWRITE TABLE students
PARTITION ( class = “12”, section = “science”)
Select * from students_data where class = “12” and section = “science”
All the partitions of the table students_data can be dynamically inserted by setting below properties:
Set hive.exec.dynamic.partition = True;
Set hive.exec.dynamic.partition.mode = unstrict
Set hive.exec.max.dynamic.partition.pernode = 1000;
CREATE TABLE clause will also create a table, and schema will be taken from the select clause.
3. Merge data in tables
Data can be merged from tables using classic SQL joins like inner, full outer, left, right join.
Code:
Select a.roll_number, class, section from students as a
inner join pass_table as b
on a.roll_number = b.roll_number
This will return class and section of all the roll numbers who have passed. Using a left join to this will return the “grade” for only pass students and “NULL” for the failed ones.
Code:
Select a.roll_number, class, section, b.grade from students as a
Left join pass_table as b
on a.roll_number = b.roll_number
UNION ALL and UNION are also used to append data present in two tables. However, few things need to be taken care of on doing so like, Schema of both the tables should be same. UNION is used to append the table and return unique records while UNION ALL returns all the records, including duplicates.
4. Ordering a table
ORDER BY clause enables total ordering of the data set by passing all data through one reducer. This may take a long time for large data tables, so SORT BY clause can be used to achieve partial sorting, by sorting each reducer.
Code:
Select customer_id, spends from customer as a order by spends DESC limit 100
This will return the top 100 customers with highest spends.
5. Aggregation of data in a table
Aggregation is done using aggregate functions that returns a single value after doing computation on many rows. These are count(col), sum(col), avg(col), min(col), max(col), stddev_pop(col), percentile_approx(int_expr, P, NB), where NB is number of histogram bins for estimation), collect_set(col), this returns duplicate elements after removing collection column.
The set property which helps in improving the performance of aggregation is hive.map.aggr = true.
“GROUP BY” clause is used with an aggregate function.
Example:
Code:
Select year(date_yy), avg(spends) from customer_spends where merchant = “Retail” group by year(date_yy)
HAVING clause is used to restrict the output from GROUP BY, which is done using a subquery.
6. Conditional statements
CASE…WHEN…THEN clause is similar to if-else statements to perform a conditional operation on any column in a query.
For example:
Code:
Select customer,
Case when percentage <40 then “Fail”
When percentage >=40 and percentage <80 then “Average” Else “Excellent”
End as rank From students;
7. Filtering of data
WHERE clause is used to filter data in HiveQL. LIKE is used along with WHERE clause as a predicate operator to match a regular expression in a record.
8. Way to escape an illegal identifier
There is a way to use special characters or keywords or space in columns or partition names by enclosing it in backticks ( ` ).
Comments in Hive Scripts:
There is a way to add comment lines to the Hive script by starting it with the string ‘- -‘.
Example:
Below is the code to display students data.
Code:
Select * from student_table;
This only works in scripts, if pasted in CLI error messages will get displayed.
HiveQL supports a member of file formats like Textfile, Parquet, etc. The ORC (Optimized Row Columnar) format can support tables upto 300 PB as Hive supports ANSI SQL and ACID (Atomic, Consistent, Isolated and Durable) transactions.
Recommended Articles
This is a guide to HiveQL Queries. Here we discuss the introduction to HiveQL Queries along with data types and types of HiveQL queries, respectively. You may also have a look at the following articles to learn more –