• Hive,  ORC,  Storage

    ORC File Format Internals – Creating Large Stripes in Hive Tables

    Usually the source data arrives as compressed text files, and the first step in an ETL process is to convert them to a columnar format for more effective query execution by users.

    Let’s consider a simple example when we have a single 120 MB source file in .gz format:

    $ aws s3 ls s3://cloudsqale/hive/events.db/events_raw/
    2018-12-16 18:49:45  120574494 data.gz
    

    and want to convert it into a Hive table with the ORC file format having 256 MB stripe size. Will 120 MB of .gz data be loaded into a single 256 MB stripe? Not so easy.

  • Amazon,  AWS,  EMR,  Hive,  I/O,  S3

    S3 Writes When Inserting Data into a Hive Table in Amazon EMR

    Often in an ETL process we move data from one source into another, typically doing some filtering, transformations and aggregations. Let’s consider which write operations are performed in S3.

    Just to focus on S3 writes I am going to use a very simple SQL INSERT statement just moving data from one table into another without any transformations as follows:

    INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-02', event_hour = '00')
    SELECT
      record_id,
      event_timestamp,
      event_name,
      app_name,
      country,
      city,
      payload
    FROM events_raw;
    
  • Amazon,  AWS,  EMR,  Hive,  ORC,  Tez

    Tez Internals #2 – Number of Map Tasks for Large ORC Files with Small Stripes in Amazon EMR

    Let’s see how Hive on Tez defines the number of map tasks when the input data is stored in large ORC files but having small stripes.

    Note. All experiments below were executed on Amazon Hive 2.1.1. This article does not apply to Qubole running on Amazon AWS. Qubole has a different algorithm to define the number of map tasks for ORC files.

  • Hive,  I/O,  Tez

    Tez Internals #1 – Number of Map Tasks

    Let’s see how Apache Tez defines the number of map tasks when you execute a SQL query in Hive running on the Tez engine.

    Consider the following sample SQL query on a partitioned table:

    select count(*) from events where event_dt = '2018-10-20' and event_name = 'Started';
    

    The events table is partitioned by event_dt and event_name columns, and it is quite big – it has 209,146 partitions, while the query requests data from a single partition only:

    $ hive -e "show partitions events"
    
    event_dt=2017-09-22/event_name=ClientMetrics
    event_dt=2017-09-23/event_name=ClientMetrics
    ...
    event_dt=2018-10-20/event_name=Location
    ...
    event_dt=2018-10-20/event_name=Started
    
    Time taken: 26.457 seconds, Fetched: 209146 row(s)
    
  • Hive,  I/O,  ORC,  S3,  Storage

    Simple Hive Queries with Predicates – Compressed Text vs ORC Files

    Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.

    Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.

    Compressed Text Files

    Consider the following sample table:

    CREATE TABLE clicks
    (
       id STRING, 
       name STRING,
       ... 
       referral_id STRING
    )
    STORED AS TEXTFILE
    LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';
    

    In my case s3://cloudsqale/hive/dmtolpeko.db/clicks contains single file data.txt.gz that has 27.3M rows and relatively small size of 5.3 GB.

  • Hive,  Hue,  Qubole

    Operation Timed Out for Hive Query in Hue and Qubole

    Sometimes when you execute a very simple query in Hive you can get “The operation timed out. Do you want to retry?” error in Hue or “Error: java.io.IOException: Time limit exceeded for local fetch task with SimpleFetchOptimization” error in Qubole.

    TL;DR: Set this before your query to resolve this:

    set hive.fetch.task.conversion = none;

    Consider the following query that times out in my case:

    SELECT * 
    FROM clicks
    WHERE event_dt >= '2018-09-17' AND event_name = 'WEB_EVENT' AND 
      application = 'CLOUD_APP' AND env = 'PRD' AND id = 5
    LIMIT 100;
    

    Instead of executing a MapReduce or Tez job Hive just decides that it can read the data directly from the storage, it takes too long time so a time out happens.