Sometimes when you execute a very simple query in Hive you can get “The operation timed out. Do you want to retry?” error in Hue or “Error: java.io.IOException: Time limit exceeded for local fetch task with SimpleFetchOptimization” error in Qubole.
TL;DR: Set this before your query to resolve this:
set hive.fetch.task.conversion = none;
Consider the following query that times out in my case:
SELECT * FROM clicks WHERE event_dt >= '2018-09-17' AND event_name = 'WEB_EVENT' AND application = 'CLOUD_APP' AND env = 'PRD' AND id = 5 LIMIT 100;
Instead of executing a MapReduce or Tez job Hive just decides that it can read the data directly from the storage, it takes too long time so a time out happens.
In my case the problem happens for a partitioned table and each partition has statistics.
Note that clicks
table has hourly partitions while the query fetches daily data:
describe formatted clicks partition (event_dt='2018-09-17', event_hour='09'); ... Partition Parameters: EXTERNAL TRUE numFiles 3 serialization.format 1 totalSize 206273994
At the same time, there is no statistics at the table level:
describe formatted clicks; ... Table Type: EXTERNAL_TABLE Table Parameters: EXTERNAL TRUE transient_lastDdlTime 1531380118
I run Hive 2.1 from Amazon EMR and Qubole and settings are:
set hive.fetch.task.conversion=more; set hive.fetch.task.conversion.threshold=1073741824;
Although partitions size is much lager than 1GB specified by hive.fetch.task.conversion.threshold
Hive cannot correctly estimate the size of the table and still reads it directly.
It you look at Hive client (or HiveServer2) logs, you can clearly see that instead of launching a MapReduce or Tez job Hive driver reads the data itself:
2018-09-26T09:34:49,775 Opening 's3n://cloudsqale/hive/dmtolpeko.db/clicks/2018-09-17/00/part-r-00000.gz' for reading ... (read of 202 other files here) ... 2018-09-26T09:54:26,903 Opening 's3n://cloudsqale/hive/dmtolpeko.db/clicks/2018-09-18/20/part-r-00002.gz' for reading 2018-09-26T09:54:29,399 S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. 2018-09-26T09:54:29,444 CliDriver - Fetched: 100 row(s)
Note that Hive still did the partition pruning and did not read the entire table. Also notice that Hive stopped reading data files when it found first 100 rows (there is LIMIT 100 in the SQL query).
You can see that Hive took almost 20 minutes to execute this query from Hive CLI, and it definitely times out when running in Hue or Qubole, so you have to force an execution of MapReduce or Tez job by setting:
set hive.fetch.task.conversion = none;