• Hive,  I/O,  ORC,  S3,  Storage

    Simple Hive Queries with Predicates – Compressed Text vs ORC Files

    Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.

    Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.

    Compressed Text Files

    Consider the following sample table:

    CREATE TABLE clicks
    (
       id STRING, 
       name STRING,
       ... 
       referral_id STRING
    )
    STORED AS TEXTFILE
    LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';
    

    In my case s3://cloudsqale/hive/dmtolpeko.db/clicks contains single file data.txt.gz that has 27.3M rows and relatively small size of 5.3 GB.

  • Hive,  Hue,  Qubole

    Operation Timed Out for Hive Query in Hue and Qubole

    Sometimes when you execute a very simple query in Hive you can get “The operation timed out. Do you want to retry?” error in Hue or “Error: java.io.IOException: Time limit exceeded for local fetch task with SimpleFetchOptimization” error in Qubole.

    TL;DR: Set this before your query to resolve this:

    set hive.fetch.task.conversion = none;

    Consider the following query that times out in my case:

    SELECT * 
    FROM clicks
    WHERE event_dt >= '2018-09-17' AND event_name = 'WEB_EVENT' AND 
      application = 'CLOUD_APP' AND env = 'PRD' AND id = 5
    LIMIT 100;
    

    Instead of executing a MapReduce or Tez job Hive just decides that it can read the data directly from the storage, it takes too long time so a time out happens.