• Hive,  Hue,  Qubole

    Operation Timed Out for Hive Query in Hue and Qubole

    Sometimes when you execute a very simple query in Hive you can get “The operation timed out. Do you want to retry?” error in Hue or “Error: java.io.IOException: Time limit exceeded for local fetch task with SimpleFetchOptimization” error in Qubole.

    TL;DR: Set this before your query to resolve this:

    set hive.fetch.task.conversion = none;

    Consider the following query that times out in my case:

    SELECT * 
    FROM clicks
    WHERE event_dt >= '2018-09-17' AND event_name = 'WEB_EVENT' AND 
      application = 'CLOUD_APP' AND env = 'PRD' AND id = 5
    LIMIT 100;
    

    Instead of executing a MapReduce or Tez job Hive just decides that it can read the data directly from the storage, it takes too long time so a time out happens.

  • Amazon,  AWS,  I/O,  Logs,  S3,  Storage

    Collecting S3 Access Logs

    Amazon allows you to enable S3 access logging that you can use to monitor S3 performance: request rate, I/O workload, user and compute node level statistics, service delays and outages, and much more.

    S3 log files are quite small, uncompressed text files that in case of intensive S3 usage can be generated almost every second:

    ...
    2018-09-20 16:20:27     323567 2018-09-20-16-20-26-DE17FAE504462084
    2018-09-20 16:20:28     598192 2018-09-20-16-20-27-5F17C98DFA22DA31
    2018-09-20 16:20:29     618862 2018-09-20-16-20-28-4660E2CBCB0FB2C5
    2018-09-20 16:20:32     381675 2018-09-20-16-20-31-16549B7BABDA06AE
    2018-09-20 16:20:33     405131 2018-09-20-16-20-32-14AB46312C254397
    2018-09-20 16:20:34     587042 2018-09-20-16-20-33-385E799AFCEBAEE3
    2018-09-20 16:20:35     358275 2018-09-20-16-20-34-FA52E601A410E529
    2018-09-20 16:20:36     604080 2018-09-20-16-20-35-C02066EDF9026EF9
    ...
    

    So you can have 35K+ files generated per day (and there is no a sub-directory for each day), and if you are going to analyze S3 statistics for a long period of time (weeks, months), the performance of your Hive or Presto queries can be very low.

    Additionally there is often a lifecycle rule defined to keep logs only for 1-2 days.