Hive,  I/O,  ORC,  S3,  Storage

Simple Hive Queries with Predicates – Compressed Text vs ORC Files

Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.

Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.

Compressed Text Files

Consider the following sample table:

CREATE TABLE clicks
(
   id STRING, 
   name STRING,
   ... 
   referral_id STRING
)
STORED AS TEXTFILE
LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';

In my case s3://cloudsqale/hive/dmtolpeko.db/clicks contains single file data.txt.gz that has 27.3M rows and relatively small size of 5.3 GB.

I want to interactively query this table using SQL, for example, to find clicks by a referral ID, and expect to get just a dozen of rows as the result:

SET hive.fetch.task.conversion = more;
SET hive.fetch.task.conversion.threshold = 10737418240; -- 10 GB

SELECT * FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7'

Since there is only one .gz file that is non-splittable i.e. only one Mapper can process it, I specified to use the direct fetch so Hive reads this file without launching any MapReduce or Tez jobs.

This is a text file, so there is no way to push the predicate down, and Hive has to read and decompress the entire file. It took 6 minutes 33 seconds to execute this query.

I can see the following records in the Hive log:

2018-10-01 19:44:14 Compiling command
SELECT * FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7'

2018-10-01 19:44:19 Semantic Analysis Completed
2018-10-01 19:44:19 Executing command

2018-10-01 19:44:20 s3n.S3NativeFileSystem
Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/data.txt.gz' for reading

2018-10-01 19:50:52 CliDriver - Fetched: 26 row(s)

Looking at the S3 access log I can see a single GET OBJECT request:

S3 Operation Key Bytes Sent Total Time (ms) Turnaround (ms)
REST.GET.OBJECT data.txt.gz 5,371,182,086 390,601 140

Note. Total time in S3 log shows how much time was spent on reading and processing the data file by a client (in my case, decompressing data and iterating the table rows).

You cannot use Total time to evaluate the S3 performance. For example, if I run the following code just to read the same data file without decompressing and iterating rows:

#!/usr/bin/python
import boto3
from boto3.s3.transfer import TransferConfig

s3 = boto3.client('s3')
config = TransferConfig(use_threads = False, multipart_threshold = 10000000000)
s3.download_file("cloudsqale", "hive/dmtolpeko.db/clicks/data.txt.gz", "/dev/null", Config=config)

I get the following metrics in S3 access log:

S3 Operation Key Bytes Sent Total Time (ms) Turnaround (ms)
REST.GET.OBJECT data.txt.gz 5,371,182,086 53,790 112

So working with compressed text files in Hive brings significant overhead:

Single Thread Read, Decompress and Iterate (5.3 GB file) 6 min 33 seconds 13.8 MB/sec
Single Thread Read (5.3 GB file) 54 seconds 101.3 MB/sec

ORC Files

Now let’s move the data to a table that stores data as ORC files with the following properties:

CREATE TABLE clicks
(
   id STRING, 
   name STRING,
   ... 
   referral_id STRING
)
STORED AS ORC
LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/'
TBLPROPERTIES 
("orc.stripe.size"="268435456",
 "orc.row.index.stride"="100000",
 "orc.bloom.filter.columns"="referral_id");

I loaded all data into a single ORC file but its size is about 6.5 GB now compared with 5.3 GB of compressed text file size having the same data.

Let’s run the query now:

SET hive.fetch.task.conversion = more;
SET hive.fetch.task.conversion.threshold = 10737418240; -- 10 GB

SET hive.optimize.ppd = true;
SET hive.optimize.index.filter = true; 

SELECT * FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7'

Now it took only 40 seconds to find the rows:

2018-10-03 20:35:28 Compiling command
SELECT * FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7'

2018-10-03 20:35:32 Semantic Analysis Completed
2018-10-03 20:35:32 Executing command

2018-10-03 20:35:32 OrcInputFormat.java:generateSplitsInfo
ORC pushdown predicate: leaf-0 = (EQUALS referral_id 3fc6a3162dd04ce4b3964d13067a0ba7)

2018-10-03 20:35:32 s3n.S3NativeFileSystem
Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading

... 11 other Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading

2018-10-03 20:36:08 s3n.S3NativeFileSystem
Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading

2018-10-03T20:36:11 CliDriver - Fetched: 26 row(s)

In S3 access log there are 204 GET.OBJECT operations now with total 1.4 GB read and total 36.4 seconds read time.

My table has 63 columns, let’s see what performance I can get if I select only the required 3 columns id, name and referral_id:

SELECT id, name, referral_id FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7'

Now it took only 9 seconds (!) to find rows:

2018-10-04 13:42:16 Compiling command
SELECT id, name, referral_id FROM clicks WHERE referral_id = '3fc6a3162dd04ce4b3964d13067a0ba7'

2018-10-04 13:42:20 Semantic Analysis Completed
2018-10-04 13:42:20 Executing command

2018-10-04 13:42:20 OrcInputFormat.java:generateSplitsInfo
ORC pushdown predicate: leaf-0 = (EQUALS referral_id 3fc6a3162dd04ce4b3964d13067a0ba7)

2018-10-04 13:42:20 s3n.S3NativeFileSystem
Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading

2018-10-04T13:42:28 s3n.S3NativeFileSystem
Opening 's3://cloudsqale/hive/dmtolpeko.db/clicks/000002_0' for reading

2018-10-04T13:42:29 CliDriver - Fetched: 26 row(s)

In S3 access log there are 55 GET.OBJECT operations now with total 90 MB (!) read and total 4 seconds read time.

Predicate Pushdown Option

Note that for working with ORC files it is important to set the option hive.optimize.index.filter = true. Without this option predicate pushdown is not performed and queries against ORC files work much slower:

Select 3 columns 40 seconds 303 S3 GET.OBJECT operations 540 MB read
Select All columns 4 minutes 50 seconds 202 S3 GET.OBJECT operations 6.5 GB read

Summary

Here is the result of executing my Hive queries with the predicate:

S3 Read Size Query Time
Compressed Text File 5.3 GB 6 min 33 seconds
ORC File, No Predicate Pushdown, All Columns Selected 6.5 GB 4 minutes 50 seconds
ORC File, Predicate Pushdown, All Columns Selected 1.4 GB 40 seconds
ORC File, No Predicate Pushdown, 3 Columns Selected 540 MB 40 seconds
ORC File, Predicate Pushdown, 3 Columns Selected 90 MB 9 seconds

You can see that it is very important to set the predicate push down option and select only necessary columns to get the best performance from ORC files.

Amazon EMR Hive 2.1