• Amazon,  AWS,  EMR,  Hive,  ORC,  Tez

    Tez Internals #2 – Number of Map Tasks for Large ORC Files with Small Stripes in Amazon EMR

    Let’s see how Hive on Tez defines the number of map tasks when the input data is stored in large ORC files but having small stripes.

    Note. All experiments below were executed on Amazon Hive 2.1.1. This article does not apply to Qubole running on Amazon AWS. Qubole has a different algorithm to define the number of map tasks for ORC files.

  • Hive,  I/O,  ORC,  S3,  Storage

    Simple Hive Queries with Predicates – Compressed Text vs ORC Files

    Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.

    Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.

    Compressed Text Files

    Consider the following sample table:

    CREATE TABLE clicks
    (
       id STRING, 
       name STRING,
       ... 
       referral_id STRING
    )
    STORED AS TEXTFILE
    LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';
    

    In my case s3://cloudsqale/hive/dmtolpeko.db/clicks contains single file data.txt.gz that has 27.3M rows and relatively small size of 5.3 GB.