Flink and S3 Entropy Injection for Checkpoints – Large-Scale Data Engineering in Cloud

When you use S3 for storing checkpoints it can easily become a bottleneck especially for your Flink application with a lot of subtasks. To overcome this problem FLINK-9061 introduced an entropy ingestion to the checkpoint path.

But the Flink documentation provides a misleading example (at least up to Flink 1.13) that actually destroys the value of the checkpoint entropy.

The Flink documentation offers the following example for S3 entropy:
s3://my-bucket/checkpoints/_entropy_/dashboard-job

That means that every checkpoint key will still start with constant checkpoints/ prefix followed by a random character sequence put in place of _entropy_.

We did not pay much attention to this in one of our applications until we started to notice the application “freeze” and the growing number of HTTP 503 Amazon S3 Slow Down errors.

For scalability of large scale S3 operations it is still important (at the time of writing, no matter what Amazon says) to have keys that start with random values i.e the checkpoint entropy must start immediately after the S3 bucket name, for example:
s3://my-bucket/_entropy_/...

Making this change we were able to reduce the number of 503 Slow Down errors by 20-30x. It is interesting that the original FLINK-9061 provides a correct example of the checkpoint entropy from Netflix:

s3://bucket/_ENTROPY_KEY_/flink/checkpoints