Continued from the previous posts, this post will look into more details of partitioned tables. Note that Hive does not support row-level inserts, updates, and deletes. However, Hive adds extensions such as table partitioning for better performance in the context of Hadoop.
Partitioned Managed Tables
In general, partitioning data means distributing data load horizontally, moving data physically closer to its most frequent users. In Hive, partitioning tables changes how Hive structures its data storage for some performance gain.
In “Programming Hive”, the authors present a hypothetical problem where one will regularly query some
employees table by country and state, e.g., all employees in California, US or Alberta, Canada. Therefore, partitioning this table by country and state is a logical thing to do.
1 2 3 4 5 6 7 8
PARTITIONED BY clause, Hive will store data for these tables in a subdirectory
employees under the directory defined by
hive.metastore.warehouse.dir (see Managed tables). However, Hive will now create subdirectories inside
employees directory for the above partitioning structure:
1 2 3 4 5 6 7
The actual directory names depends on values of partition keys (e.g., country and state). For very large data sets, partitioning can improve query performance, but only if the partitioning scheme reflects common range filtering (e.g., by countries or states). When we add predicates to WHERE clauses that filter on partition values, these predicates are called partition filters (e.g.,
WHERE state = 'CA').
You can view the partitions in a table with
SHOW PARTITIONS, as shown in examples below:
1 2 3 4
Given a partitioned table, a query across all partitions can result in a enormous MapReduce job, especially for a huge data set. It is probably desirable to put in place a safety measure which prohibits queries without any filter on partitions. Hive has a “strict” mode for that.
1 2 3 4 5
Partitioned External Tables
You can use partitioning with external tables. The combination gives you a way to “share” data with other tools, while still optimizing query performance. While LOCATION clause is required for non-partitioned external table to specify data location, it is not required for external partitioned tables. Instead,
ALTER TABLE statement is used to add data in each partition separately.
1 2 3 4 5 6 7 8 9 10 11 12 13
ALTER TABLE … ADD PARTITION is not limited to external tables. You can use it with managed tables, too. However, it is not recommended since you have to manually keep track of this partition and remember to delete data in case you want to completely drop the managed table.
Example use case of partitioned external tables
For example, each day we might use the following procedure to move data older than a month to S3:
1) Copy the data for the partition being moved to S3. For example, you can use the hadoop distcp command:
2) Alter the table to point the partition to the S3 location:
3) Remove the HDFS copy of the partition using the hadoop fs -rmr command:
Altering Partitioned Tables
ALTER TABLE statements for manipulating table partitions are shown in the following examples:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
ALTER TABLE ... ARCHIVE PARTITION statement captures the partition files into a Hadoop archive (HAR) file. This only reduces the number of files in the filesystem, reducing the load on the NameNode, but doesn’t provide any space savings. To reverse the operation, substitute UNARCHIVE for ARCHIVE. This feature is only available for individual partitions of partitioned tables.