Personal Programming Notes

To err is human; to debug, divine.

Learning Hive (Pt. 8): Data Manipulation in HiveQL

This post is about HiveQL’s parts used to put data into tables and extract data from tables to the filesystem.

LOAD

Hive has no row-level insert, update, and delete operations. Instead, we use one of its many “bulk” load operations to put data into a table.

Hive LOAD
1
2
3
> LOAD DATA LOCAL INPATH '/home/ca-employees'
OVERWRITE INTO TABLE employees
PARTITION ( country = 'US', state = 'CA' ); -- remove this if the target is not partitioned

This LOAD command will first create the directory for the partition, if necessary, then copy the data to it. If the LOCAL keyword is specified, the path is assumed in the local filesystem. If you use the OVERWRITE keyword, any data present in the target directory will be removed. Without the keyword, the new files are added to the target directory. More details can be found here.

INSERT

The INSERT statement will load data into a table, as shown below:

Hive INSERT
1
2
3
4
> INSERT OVERWRITE TABLE employees
PARTITION ( country = 'US', state = 'OR' )
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';

If the keyword OVERWRITE is replaced with INTO, Hive appends the data instead of replaces it.

Similar to other SQL dialects, you can also CREATE TABLE and insert SELECT query results into the new table in a single statement:

1
2
3
4
> CREATE TABLE ca_employees
AS SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

Exporting data

If you want the data files as is, then exporting data can be as simple as copying the directories or files:

1
hadoop fs -cp source_path target_path

Otherwise, you can use INSERT ... DIRECTORY ... statement with associated SELECT statement to specify the data you want, as in this example:

Export from Hive
1
2
3
4
> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';