June 2020 – Large-Scale Data Engineering in Cloud

Hadoop, YARN

Hadoop YARN – Monitoring Resource Consumption by Running Applications in Multi-Cluster Environments

June 25, 2020

In cloud it is typical to run multiple compute clusters, so browsing the Web UI for every cluster to check the current resource consumption by applications is not always easy and convenient especially if YARN clusters are managed by different Hadoop distributions (Amazon EMR, Cloudera, Qubole etc.).

Let’s see how you can automate this process and find out how many applications are running and which resources they are consuming (containers, memory and CPU).

Read More

dmtolpeko
I/O, Parquet, Storage

How Map Column is Written to Parquet – Converting JSON to Map to Increase Read Performance

June 18, 2020

It is quite common today to convert incoming JSON data into Parquet format to improve the performance of analytical queries.

When JSON data has an arbitrary schema i.e. different records can contain different key-value pairs, it is common to parse such JSON payloads into a map column in Parquet.

How is it stored? What read performance can you expect? Will json_map["key"] read only data for key or the entire JSON?

Read More

dmtolpeko
Flink, I/O, Parquet, S3

Flink Streaming to Parquet Files in S3 – Massive Write IOPS on Checkpoint

June 9, 2020

It is quite common to have a streaming Flink application that reads incoming data and puts them into Parquet files with low latency (a couple of minutes) for analysts to be able to run both near-realtime and historical ad-hoc analysis mostly using SQL queries.

But let’s review write patterns and problems that can appear for such applications at scale.

Read More

dmtolpeko
AWS, I/O, S3, Storage

S3 Low Latency Writes – Using Aggressive Retries to Get Consistent Latency – Request Timeouts

June 4, 2020

Amazon S3 is highly scalable distributed system that can handle extremely large volumes of data, can adapt to an increasing workload and provide quite good performance as a file storage.

But sometimes you have to tweak it to run faster that can be especially important for latency-sensitive applications.

Read More

dmtolpeko