Operation Timed Out for Hive Query in Hue and Qubole – Large-Scale Data Engineering in Cloud

Sometimes when you execute a very simple query in Hive you can get “The operation timed out. Do you want to retry?” error in Hue or “Error: java.io.IOException: Time limit exceeded for local fetch task with SimpleFetchOptimization” error in Qubole.

TL;DR: Set this before your query to resolve this:

set hive.fetch.task.conversion = none;

Consider the following query that times out in my case:

SELECT * 
FROM clicks
WHERE event_dt >= '2018-09-17' AND event_name = 'WEB_EVENT' AND 
  application = 'CLOUD_APP' AND env = 'PRD' AND id = 5
LIMIT 100;

Instead of executing a MapReduce or Tez job Hive just decides that it can read the data directly from the storage, it takes too long time so a time out happens.

In my case the problem happens for a partitioned table and each partition has statistics.

Note that clicks table has hourly partitions while the query fetches daily data:

describe formatted clicks partition (event_dt='2018-09-17', event_hour='09');
...
Partition Parameters:
        EXTERNAL                TRUE
        numFiles                3
        serialization.format    1
        totalSize               206273994

At the same time, there is no statistics at the table level:

describe formatted clicks;
...
Table Type:             EXTERNAL_TABLE
Table Parameters:
        EXTERNAL                TRUE
        transient_lastDdlTime   1531380118

I run Hive 2.1 from Amazon EMR and Qubole and settings are:

set hive.fetch.task.conversion=more;
set hive.fetch.task.conversion.threshold=1073741824;

Although partitions size is much lager than 1GB specified by hive.fetch.task.conversion.threshold Hive cannot correctly estimate the size of the table and still reads it directly.

It you look at Hive client (or HiveServer2) logs, you can clearly see that instead of launching a MapReduce or Tez job Hive driver reads the data itself:

2018-09-26T09:34:49,775 
Opening 's3n://cloudsqale/hive/dmtolpeko.db/clicks/2018-09-17/00/part-r-00000.gz' for reading
...
(read of 202 other files here)
...
2018-09-26T09:54:26,903
Opening 's3n://cloudsqale/hive/dmtolpeko.db/clicks/2018-09-18/20/part-r-00002.gz' for reading

2018-09-26T09:54:29,399 
S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. 

2018-09-26T09:54:29,444 
CliDriver - Fetched: 100 row(s)

Note that Hive still did the partition pruning and did not read the entire table. Also notice that Hive stopped reading data files when it found first 100 rows (there is LIMIT 100 in the SQL query).

You can see that Hive took almost 20 minutes to execute this query from Hive CLI, and it definitely times out when running in Hue or Qubole, so you have to force an execution of MapReduce or Tez job by setting:

set hive.fetch.task.conversion = none;