• Hive,  I/O,  ORC,  S3,  Storage

    Simple Hive Queries with Predicates – Compressed Text vs ORC Files

    Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.

    Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.

    Compressed Text Files

    Consider the following sample table:

    CREATE TABLE clicks
    (
       id STRING, 
       name STRING,
       ... 
       referral_id STRING
    )
    STORED AS TEXTFILE
    LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';
    

    In my case s3://cloudsqale/hive/dmtolpeko.db/clicks contains single file data.txt.gz that has 27.3M rows and relatively small size of 5.3 GB.