Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.
Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.
Compressed Text Files
Consider the following sample table:
CREATE TABLE clicks ( id STRING, name STRING, ... referral_id STRING ) STORED AS TEXTFILE LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';
In my case s3://cloudsqale/hive/dmtolpeko.db/clicks
contains single file data.txt.gz
that has 27.3M rows and relatively small size of 5.3 GB.
I want to interactively query this table using SQL, for example, to find clicks by a referral ID, and expect to get just a dozen of rows as the result:
SET hive.fetch.task.conversion = more; SET hive.fetch.task.conversion.threshold = 10737418240; -- 10 GB SELECT * FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7'
Since there is only one .gz file that is non-splittable i.e. only one Mapper can process it, I specified to use the direct fetch so Hive reads this file without launching any MapReduce or Tez jobs.
This is a text file, so there is no way to push the predicate down, and Hive has to read and decompress the entire file. It took 6 minutes 33 seconds to execute this query.
I can see the following records in the Hive log:
2018-10-01 19:44:14 Compiling command SELECT * FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7' 2018-10-01 19:44:19 Semantic Analysis Completed 2018-10-01 19:44:19 Executing command 2018-10-01 19:44:20 s3n.S3NativeFileSystem Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/data.txt.gz' for reading 2018-10-01 19:50:52 CliDriver - Fetched: 26 row(s)
Looking at the S3 access log I can see a single GET OBJECT request:
S3 Operation | Key | Bytes Sent | Total Time (ms) | Turnaround (ms) |
---|---|---|---|---|
REST.GET.OBJECT | data.txt.gz | 5,371,182,086 | 390,601 | 140 |
Note. Total time in S3 log shows how much time was spent on reading and processing the data file by a client (in my case, decompressing data and iterating the table rows).
You cannot use Total time to evaluate the S3 performance. For example, if I run the following code just to read the same data file without decompressing and iterating rows:
#!/usr/bin/python import boto3 from boto3.s3.transfer import TransferConfig s3 = boto3.client('s3') config = TransferConfig(use_threads = False, multipart_threshold = 10000000000) s3.download_file("cloudsqale", "hive/dmtolpeko.db/clicks/data.txt.gz", "/dev/null", Config=config)
I get the following metrics in S3 access log:
S3 Operation | Key | Bytes Sent | Total Time (ms) | Turnaround (ms) |
---|---|---|---|---|
REST.GET.OBJECT | data.txt.gz | 5,371,182,086 | 53,790 | 112 |
So working with compressed text files in Hive brings significant overhead:
Single Thread Read, Decompress and Iterate (5.3 GB file) | 6 min 33 seconds | 13.8 MB/sec |
Single Thread Read (5.3 GB file) | 54 seconds | 101.3 MB/sec |
ORC Files
Now let’s move the data to a table that stores data as ORC files with the following properties:
CREATE TABLE clicks ( id STRING, name STRING, ... referral_id STRING ) STORED AS ORC LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/' TBLPROPERTIES ("orc.stripe.size"="268435456", "orc.row.index.stride"="100000", "orc.bloom.filter.columns"="referral_id");
I loaded all data into a single ORC file but its size is about 6.5 GB now compared with 5.3 GB of compressed text file size having the same data.
Let’s run the query now:
SET hive.fetch.task.conversion = more; SET hive.fetch.task.conversion.threshold = 10737418240; -- 10 GB SET hive.optimize.ppd = true; SET hive.optimize.index.filter = true; SELECT * FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7'
Now it took only 40 seconds to find the rows:
2018-10-03 20:35:28 Compiling command SELECT * FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7' 2018-10-03 20:35:32 Semantic Analysis Completed 2018-10-03 20:35:32 Executing command 2018-10-03 20:35:32 OrcInputFormat.java:generateSplitsInfo ORC pushdown predicate: leaf-0 = (EQUALS referral_id 3fc6a3162dd04ce4b3964d13067a0ba7) 2018-10-03 20:35:32 s3n.S3NativeFileSystem Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading ... 11 other Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading 2018-10-03 20:36:08 s3n.S3NativeFileSystem Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading 2018-10-03T20:36:11 CliDriver - Fetched: 26 row(s)
In S3 access log there are 204 GET.OBJECT operations now with total 1.4 GB read and total 36.4 seconds read time.
My table has 63 columns, let’s see what performance I can get if I select only the required 3 columns id, name
and referral_id
:
SELECT id, name, referral_id FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7'
Now it took only 9 seconds (!) to find rows:
2018-10-04 13:42:16 Compiling command SELECT id, name, referral_id FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7' 2018-10-04 13:42:20 Semantic Analysis Completed 2018-10-04 13:42:20 Executing command 2018-10-04 13:42:20 OrcInputFormat.java:generateSplitsInfo ORC pushdown predicate: leaf-0 = (EQUALS referral_id 3fc6a3162dd04ce4b3964d13067a0ba7) 2018-10-04 13:42:20 s3n.S3NativeFileSystem Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading 2018-10-04T13:42:28 s3n.S3NativeFileSystem Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading 2018-10-04T13:42:29 CliDriver - Fetched: 26 row(s)
In S3 access log there are 55 GET.OBJECT operations now with total 90 MB (!) read and total 4 seconds read time.
Predicate Pushdown Option
Note that for working with ORC files it is important to set the option hive.optimize.index.filter = true
. Without this option predicate pushdown is not performed and queries against ORC files work much slower:
Select 3 columns | 40 seconds | 303 S3 GET.OBJECT operations | 540 MB read |
Select All columns | 4 minutes 50 seconds | 202 S3 GET.OBJECT operations | 6.5 GB read |
Summary
Here is the result of executing my Hive queries with the predicate:
S3 Read Size | Query Time | |
---|---|---|
Compressed Text File | 5.3 GB | 6 min 33 seconds |
ORC File, No Predicate Pushdown, All Columns Selected | 6.5 GB | 4 minutes 50 seconds |
ORC File, Predicate Pushdown, All Columns Selected | 1.4 GB | 40 seconds |
ORC File, No Predicate Pushdown, 3 Columns Selected | 540 MB | 40 seconds |
ORC File, Predicate Pushdown, 3 Columns Selected | 90 MB | 9 seconds |
You can see that it is very important to set the predicate push down option and select only necessary columns to get the best performance from ORC files.
Amazon EMR Hive 2.1