Exploring Hive/Impala partitions – continued

By | October 18, 2016

After discussing the use and benefits of Hive/Impala partitions, we will look deeper at how Hive implements partitioning at the low level.

First of all, we should determine where in HDFS hive metastore keeps its tables. This path is represented by hive parameter hive.metastore.warehouse.dir that can be changed in hive-sire.xml file or via Cloudera manager (under Hive service -> configuration):

View full size image

Once we have the path we can examine it in HDFS:

[[email protected] ~]$ hdfs dfs -ls /user/hive/warehouse
Found 6 items
drwxr-xr-x - hive hive 0 2016-06-06 14:30 /user/hive/warehouse/_tmp.sampledata1
drwxrwxrwt - hive hive 0 2016-09-15 11:32 /user/hive/warehouse/sampledata
drwxrwxrwt - hive hive 0 2016-06-07 16:24 /user/hive/warehouse/sampledata1
drwxrwxrwt - hive hive 0 2016-09-15 11:40 /user/hive/warehouse/sampledata_p
drwxrwxrwt - hive hive 0 2016-09-29 14:41 /user/hive/warehouse/zip
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part

 

We are, of course, interested in the last two tables, zip and zip_part. First, let’s examine the unpartitioned table:

[[email protected] ~]$ hdfs dfs -ls /user/hive/warehouse/zip
Found 3 items
-rwxrwxrwt 3 hive hive 773604 2016-09-29 13:28 /user/hive/warehouse/zip/14zpallagi.csv
-rwxrwxrwt 3 hive hive 21660912 2016-09-29 14:39 /user/hive/warehouse/zip/data.csv
-rwxrwxrwt 3 hive hive 21660912 2016-09-29 14:41 /user/hive/warehouse/zip/data_copy_1.csv

 

We can see that the original data file was just copied into the zip directory as is. Let’s copy the file to local filesystem and take a look at its contents:

[[email protected] ~]$ hdfs dfs -copyToLocal /user/hive/warehouse/zip/data.csv /var/lib/hadoop-hdfs
[[email protected] ~]$ tail -10 /var/lib/hadoop-hdfs/data.csv
12,FL,34737,5,230,40,190,0
12,FL,34737,6,70,0,50,0
12,FL,34739,1,130,70,40,20
12,FL,34739,2,90,40,50,0
12,FL,34739,3,50,0,50,0
12,FL,34739,4,0,0,0,0
12,FL,34739,5,30,0,20,0
12,FL,34739,6,0,0,0,0
12,FL,34741,1,13790,7280,1670,4650
12,FL,34741,2,5770,2350,1580,1680

 

You can see that there are Eight columns in the file including the state code. Now we will try to do the same with the partitioned table:

[[email protected] ~]$ hdfs dfs -ls /user/hive/warehouse/zip_part
Found 11 items
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=AK
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=AL
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=AR
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=AZ
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=CA
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=CO
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=CT
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=DC
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=DE
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=FL
drwxrwxrwt - hive hive 0 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=STATE

 

Here you can see that under the basic table directory there are numerous sub directories, corresponding to the table’s partitions, along with a general partition “state=STATE” which is used for rows that do not fall into any other partition. Let’s see what’s inside of them:

 

[[email protected] ~]$ hdfs dfs -ls /user/hive/warehouse/zip_part/state=FL
Found 1 items
-rwxrwxrwt 3 hive hive 195396 2016-09-29 14:46 /user/hive/warehouse/zip_part/state=FL/000000_0

 

Now we will copy it to local filesystem and see it’s contents:

[[email protected] ~]$ hdfs dfs -copyToLocal /user/hive/warehouse/zip_part/state=FL/000000_0 /var/lib/hadoop-hdfs
[[email protected] ~]$ ls
000000_0
[[email protected] ~]$ tail -10 000000_0
12,34737,5,230,40,190,0
12,34737,6,70,0,50,0
12,34739,1,130,70,40,20
12,34739,2,90,40,50,0
12,34739,3,50,0,50,0
12,34739,4,0,0,0,0
12,34739,5,30,0,20,0
12,34739,6,0,0,0,0
12,34741,1,13790,7280,1670,4650
12,34741,2,5770,2350,1580,1680

 

If we compare the content of this file to the content of the unpatitioned file above we can see that the unpartitioned file has Eight columns including state. The partitioned table has only Seven columns and the state column is missing. This is because the state column is already contained in the directory name and thus hive can infer the “missing” column from the directory structure instead of having redundant data.

And what happen if we partition on more than one column ?

Let’s recreate the zip_part table, this time partitioned by two columns:

create table zip_part (
zipcode int,
agi_stub int,
n1 int,
mars1 int,
mars2 int,
mars4 int) partitioned by (statefips string, state string)
row format delimited fields terminated by ',' stored as textfile;

insert into zip_part partition(statefips, state) select zipcode,agi_stub,n1,mars1,mars2,mars4,statefips,state from zip;

Then, if we examine the underlying directory structure, we can see that it has just added another level of subdirectories, corresponding to the added partition key:

[[email protected] ~]$ hdfs dfs -ls /user/hive/warehouse/zip_part
Found 11 items
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=1
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=10
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=11
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=12
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=2
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=4
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=5
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=6
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=8
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=9
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=STATEFIPS
[[email protected] ~]$ hdfs dfs -ls /user/hive/warehouse/zip_part/statefips=2
Found 1 items
drwxrwxrwt - hive hive 0 2016-10-13 16:28 /user/hive/warehouse/zip_part/statefips=2/state=AK

This concludes the Hive/Impala partitioning coverage.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.