Hadoop,  JVM,  Memory,  YARN

Hadoop YARN – Container Virtual Memory – Understanding and Solving “Container is running beyond virtual memory limits” Errors

In the previous article about YARN container memory (see, Tez Memory Tuning – Container is Running Beyond Physical Memory Limits) I wrote about the physical memory. Now I would like to pay attention to the virtual memory in YARN.

A typical YARN memory error may look like this:

Container is running beyond virtual memory limits. Current usage: 1.0 GB of 1.1 GB physical memory used; 2.9 GB of 2.4 GB virtual memory used. Killing container.

So what is the virtual memory, how to solve such errors and why is the virtual memory size often so large?

Let’s find a YARN container and investigate its memory usage:

$ ssh -i private_key hadoop@10.x.x.x

$ sudo jps -l -v

45117 org.apache.tez.runtime.task.TezChild -Xmx1152m ...
...

Process PID 45117 is a YARN container for a Tez task (a task of a Apache Hive query in my case). Using the top command we can check its virtual memory usage:

$ sudo top -p 45117

 PID    USER  PR  NI  VIRT     RES    SHR  S  %CPU   %MEM  TIME+    COMMAND
 45117  yarn  20   0  3.010m   832m   41m  S  157.7  0.7   0:31.38  java

You can see that the process virtual memory is 3.0 GB while the Java process was launched with maximum heap size of 1152 MB (-Xmx1152m). So what did take another ~2 GB?

Using the pmap command we can see details of the memory map for the process:

$ sudo pmap 45117
...
0000000001ec4000  19432K rw---    [ anon ]
00000000b8000000 670208K rw---    [ anon ]
00000000e0e80000 116224K -----    [ anon ]
00000000e8000000 392704K rw---    [ anon ]
00000000fff80000    512K -----    [ anon ]
0000000100000000   4784K rw---    [ anon ]
00000001004ac000 1043792K -----   [ anon ]
00007fa01d121000    512K rw---    [ anon ]
00007fa01d1a1000   1536K -----    [ anon ]
00007fa01d321000     20K r-x--  /usr/lib/hadoop/lib/native/libsnappy.so.1.1.3
00007fa01d326000   2044K -----  /usr/lib/hadoop/lib/native/libsnappy.so.1.1.3
...
00007fa02218e000     40K r--s-  /usr/lib/hadoop-yarn/lib/jersey-core-1.9.jar
00007fa022198000     20K r--s-  /usr/lib/hadoop-yarn/lib/commons-lang-2.6.jar
00007fa02219d000     76K r--s-  /usr/lib/hadoop-yarn/lib/zookeeper-3.4.10.jar
00007fa0221b0000     12K r--s-  /usr/lib/hadoop-yarn/lib/jsr305-3.0.0.jar
00007fa0221b3000      8K r--s-  /usr/lib/hadoop-yarn/lib/stax-api-1.0-2.jar
...
00007fa022acc000     16K r--s-  /usr/lib/hadoop/lib/commons-io-2.4.jar
00007fa022ad0000     36K r--s-  /usr/lib/hadoop/lib/jets3t-0.9.0.jar
00007fa022ad9000     20K r--s-  /usr/lib/hadoop/lib/commons-net-3.1.jar
00007fa022ade000      8K r--s-  /usr/lib/hadoop/lib/commons-codec-1.4.jar
00007fa022ae0000      8K r--s-  /usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar
00007fa022ae2000      8K r--s-  /usr/lib/hadoop/lib/curator-client-2.7.1.jar
00007fa022ae4000     24K r--s-  /usr/lib/hadoop/lib/commons-httpclient-3.1.jar
...
00007fa023ab7000     12K -----    [ anon ]
00007fa023aba000   1016K rw---    [ anon ]
00007fa023bb8000     12K -----    [ anon ]
00007fa023bbb000   1016K rw---    [ anon ]
00007fa023cb9000     12K -----    [ anon ]
00007fa023cbc000   1016K rw---    [ anon ]
00007fa023dba000     12K -----    [ anon ]
00007fa023dbd000   1016K rw---    [ anon ]
...
00007fa04d9e3000     28K r-x--  /lib64/librt-2.17.so
00007fa04d9ea000   2044K -----  /lib64/librt-2.17.so
00007fa04dbe9000      4K r----  /lib64/librt-2.17.so
00007fa04dbea000      4K rw---  /lib64/librt-2.17.so
00007fa04dbeb000     84K r-x--  /lib64/libgcc_s-4.8.3-20140911.so.1
00007fa04dc00000   2048K -----  /lib64/libgcc_s-4.8.3-20140911.so.1
00007fa04de00000      4K rw---  /lib64/libgcc_s-4.8.3-20140911.so.1
00007fa04de01000   1028K r-x--  /lib64/libm-2.17.so
00007fa04df02000   2044K -----  /lib64/libm-2.17.so
00007fa04e101000      4K r----  /lib64/libm-2.17.so
00007fa04e102000      4K rw---  /lib64/libm-2.17.so
00007fa04e103000    920K r-x--  /usr/lib64/libstdc++.so.6.0.19
00007fa04e1e9000   2044K -----  /usr/lib64/libstdc++.so.6.0.19
...
00007fa050144000      4K rw---    [ anon ]
00007ffd9b59b000    136K rw---    [ stack ]
00007ffd9b5cc000      8K r----    [ anon ]
00007ffd9b5ce000      8K r-x--    [ anon ]
ffffffffff600000      4K r-x--    [ anon ]
 total          3083100K

Although JVM does not immediately allocate the maximum heap size (-Xmx) specified for the process it reserves its maximum amount (in my example 1152 MB) in the virtual memory.

But besides JVM heap areas (marked as [ anon ] in the output above), various other I/O and system areas, there are a lot of .so shared libraries and .jar files mapped to the virtual address space of the process. In my case there are about 200 .so and 400 .jar files that’s why the virtual memory takes ~3 GB.

In YARN, there is the option yarn.nodemanager.vmem-pmem-ratio that is set to 2.1 by default. If you allocate relatively small containers at ~1 GB this ratio can be low and you may often face the "Container is running beyond virtual memory limits" errors.

It is recommended to set this ratio to a higher value, for example, 5 since the virtual address space of a YARN container may be overcrowded by a large number of .so and .jar files.

Another less recommended solution is to disable the virtual memory check by setting yarn.nodemanager.vmem-check-enabled to false.