I already wrote about recovering the ghost nodes in Amazon EMR clusters, but in this article I am going to extend this information for EMR clusters configured for auto-scaling.
In a Hadoop cluster besides Active nodes you may also have Unhealthy, Lost and Decommissioned nodes. Unhealthy nodes are running but just excluded from scheduling the tasks because they, for example, do not have enough disk space. Lost nodes are nodes that are not reachable anymore. The decommissioned nodes are nodes that successfully terminated and left the cluster.
But all these nodes are known to the Hadoop YARN cluster and you can see their details such as IP addresses, last health updates and so on.
At the same time there can be also Ghost nodes i.e. nodes that are running by Amazon EMR services but Hadoop itself does not know anything about their existence. Let’s see how you can find them.
Usually Hadoop is able to automatically recover cluster nodes from Unhealthy state by cleaning log and temporary directories. But sometimes nodes stay unhealthy for a long time and manual intervention is necessary to bring them back.
Amazon EMR allows you to define scale-out and scale-in rules to automatically add and remove instances based on the metrics you specify.
In this article I am going to explore the instance controller logs that can be very useful in monitoring the auto-scaling. The logs are located in
/emr/instance-controller/log/directory on the EMR master node.
When you run many Hadoop clusters it is useful to automatically collect metrics from all clusters in a single place (Hive table i.e.).
This allows you to perform any advanced and custom analysis of your clusters workload and not be limited to the features provided by Hadoop Administration UI tools that often offer only per cluster view so it is hard to see the whole picture of your data platform.
After creating an Amazon EMR cluster with Spark support, and running a spark application you can notice that the Spark job creates too many tasks to process even a very small data set.
For example, I have a small table
country_iso_codeshaving 249 rows and stored in a comma-delimited text file with the length of 10,657 bytes.
When running the following application on Amazon EMR 5.7 cluster with Spark 2.1.1 with the default settings I can see the large number of partitions generated:
When you run a job in Hadoop you can notice the following error:
Application with id 'application_1545962730597_2614' doesn't exist in RM. And later looking at the YARN Resource Manager UI at
http://<RM_IP_Address>:8088/cluster/appsyou can see low Application ID numbers:
Often in an ETL process we move data from one source into another, typically doing some filtering, transformations and aggregations. Let’s consider which write operations are performed in S3.
Just to focus on S3 writes I am going to use a very simple SQL INSERT statement just moving data from one table into another without any transformations as follows:
INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-02', event_hour = '00') SELECT record_id, event_timestamp, event_name, app_name, country, city, payload FROM events_raw;
Let’s see how Hive on Tez defines the number of map tasks when the input data is stored in large ORC files but having small stripes.
Note. All experiments below were executed on Amazon Hive 2.1.1. This article does not apply to Qubole running on Amazon AWS. Qubole has a different algorithm to define the number of map tasks for ORC files.