Presto vs Hive – SLA Risks for Long Running ETL – Failures and Retries Due to Node Loss

Presto is an extremely powerful distributed SQL query engine, so at some point you may consider using it to replace SQL-based ETL processes that you currently run on Apache Hive.

Although it is completely possible, you should be aware of some limitations that may affect your SLAs.

Presto is highly optimized for low latency, and to achieve this Presto builds very efficient in-memory data pipelines between tasks and stages instead of “materializing” the intermediate results like Hive does.

Here comes a problem. Although Presto queries can complete much faster in general compared with Hive, if a Presto worker dies or experiences a network partition in the middle of a query then the query fails. It cannot survive a loss of a compute node and the full query execution must be restarted from scratch.

Compute node failures are normal in a cloud environment, and some ETL processes can run for many hours, so their full restart can be prohibitive and may miss SLA requirements.

At the same time, although Hive may require more time to complete the job, it is much more resilient to node failures. Any individual task attempt can be restarted on a different node, there is a speculative execution support, and you have many options to manage the task duration and possible restart time by defining the appropriate data chunks (splits) for tasks in Hive. So in many cases the execution time and its variance can be more predictable in Hive.

In case of a node failure Presto queries fail with INTERNAL ERROR, and a few possible errors are:

Connection Timeout

com.facebook.presto.operator.PageTransportTimeoutException: 
  Encountered too many errors talking to a worker node. The node may have crashed 
  or be under too much load. This is probably a transient issue, so please retry your query
  in a few minutes. 
  - 54 failures, failure duration 300.18s, total failed request time 309.84s)
Caused by: java.io.UncheckedIOException: java.net.SocketTimeoutException: Connect Timeout

Spot Node Loss

Command execution failed with exception:
com.facebook.presto.SpotLossException

Conclusion

It definitely makes sense to use Presto for ETL, but in case of critical long running ETL with strict SLAs:

Divide workflows into smaller chunks (run a set of small SQL statements instead of a big one).

Do not use the Spot nodes that can be reclaimed on very short notice (2 minutes in the Amazon cloud).

The Presto community is working on providing fault tolerance in Presto, so hopefully partial query restarts will be available soon.