Hive container is running beyond physical memory limits

Categories: BigData

Overview

Recently I used the Hive interactive commandline tool to run a simple query against a very large table. The query failed (after some time) with:

Status: Failed------------------------------------------------------------------
Application application_012345_0123 failed 2 times due to AM Container for appattempt_0123455_0123_00002 exited with  exitCode: -104
For more detailed output, check the application tracking page: https://somehost:someport/cluster/app/application_012345_0123 Then click on links to logs of each attempt.
Diagnostics: Container [pid=12345,containerID=container_e12_012345_0123_02_000001] is running beyond physical memory limits. Current usage: 4.0 GB of 4 GB
 physical memory used; 5.8 GB of 8.4 GB virtual memory used. Killing container.
Dump of the process-tree for container_e12_012345_0123_02_000001 :
	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
	|- 12345 67890 1111 2222 (bash) 0 0 112233444 697 /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java
          -Djava.io.tmpdir=/path/to/USER/appcache/application_012345_0123/container_e12_012345_0123_02_000001/tmp
          -server
          -Djava.net.preferIPv4Stack=true -Dhdp.version=2.5.3.0-37
          -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC
          -Xmx3686m
          -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties
          -Dyarn.app.container.log.dir=/path/to/log/application_012345_0123/container_e12_012345_0123_02_000001
          -Dtez.root.logger=INFO,CLA -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster --session
          1>/path/to/log/application_012345/container_e12_012345/stdout
          2>/path/to/log/application_012345_0123/container_e12_012345_0123_02_000001/stderr  

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Application application_012345_0123 failed 2 times
  due to AM Container for appattempt_012345_0123_000002 exited with  exitCode: -104

Unfortunately, an internet search found many pages referencing similar errors, but no clear answers; some people found options that solved the problem for them, but nobody seemed to know why those options worked. After some research, here are my conclusions - a mix of research and guesswork. Corrections are welcome!

This page applies specifically to Hive 1.3.1 using Tez on Yarn (Hortonworks Data Platform 2.5.3). The principles may be useful for similar environments.

The Causes

As far as I can see, this kind of memory problem can occur in two places:

  • The Tez application master program (ie driver process which coordinates tasks)
  • A Tez task (ie a process running logic to actually read and transform input)

It can also be caused by incorrect default Tez settings for the cluster - in particular, sysadmins who explicitly include a “-Xmx” option in the default JVM args used to start Tez processes (see later), and set it too high relative to the default container size (resulting in the JVM growing beyond the allowed size for the container).

Application Master

The application master is the most common cause of this problem, triggered by a job which has a very large number of tasks. Each task needs to be computed and stored in memory, its status tracked, and its results stored; obviously lots of tasks requires lots of memory.

By default each input file and each HDFS block within each file is a data partition, and a task is created for each partition. The number of tasks therefore depends primarily upon:

  • The number of distinct input files
  • The number of blocks in the input

A complicated query may make things slightly worse, but the number of tasks following a shuffle is fixed by the “bucketing factor” rather than the number of partitions in the input - and therefore the input-partitions rather than the query complexity is the most significant factor for the number of tasks.

Resolving the “exceeds physical memory” problem therefore requires either:

  • Reducing the number of input files,
  • Allocating multiple HDFS blocks to each task, or
  • Increasing the amount of memory for the application master process

When an unpartitioned table consists of many small files, then the Hive command alter table {tablename} concatenate will combine smaller files into larger ones. It doesn’t reduce the table to a single file; by default each output file is one HDFS block in size. For a partitioned table, the concatenate command must be of form alter table {tablename} partition(field=val, field=val) concatenate. Unfortunately the partitions must be specified explicitly - no expressions or wildcards allowed. When a table has many partitions, the current recommended solution is to write a script to perform the concatenation.

When a table consists of some very large (multi-block) files then it may help to adjust the stride setting so that a single task reads a range of data consisting of multiple blocks. I haven’t got an example here - please add one in the comments if you know how to do this.

The easiest solution is of course just to allocate more memory to the application master process - see later.

Tez Tasks

There are reports on the internet of this problem occurring in the “task host” processes (rather than the application master) - presumably some complicated queries can consume significant amounts of memory within tasks.

Solutions are:

  • To simplify the query,
  • To use more tasks (and therefore less data per task), or
  • To allocate more memory to the Tez processes running tasks

Memory Allocation Principles

At runtime, Tez uses a process model similar to Spark, and different from traditional MapReduce. A single container is started via Yarn to run the application master process. This then requests N additional containers and in each container a generic “task host” process is started (a JVM). Tasks are then executed by the application master by sending a message to a task host process which creates a thread and runs the task.

Each Yarn node-manager process has a “memory resource quota” set on startup via per-host config option yarn.nodemanager.resource.memory-mb (or possibly yarn.scheduler.maximum-allocation-mb; they seem to be the same). This isn’t enforced by the operating system in any way - the yarn node manager doesn’t “reserve” the memory, just uses the value as a guide to indicate whether it can reasonably accept new container allocation requests or not. The value should be the physical amount of memory on the node minus some margin for the operating system itself and any other processes that might run on the same host.

When a Yarn client application or application master requests a new container from Yarn, it includes a “container memory size”. The container will only be allocated on a yarn node which has at least this much “free space” in its quota, and the free space value on that node is then decremented by that amount. For simplicity, config option yarn.scheduler.minimum-allocation-mb sets the “base unit” for memory allocation; any container request whose request memory size is not a multiple of this value gets rounded up to the next multiple.

Note that a “container” is not necessarily anything to do with “linux containers” or “docker containers” - it is a Yarn “logical resource reservation”.

Eventually the “owner” of the container sends a command-line string to the node-manager to start a process within that container (usually java -jar ...). The Yarn resource manager does not enforce any memory resource limits on the started process; when running on a fairly modern Linux it could potentially use cgroups to do this enforcement, but it currently does not - and for other platforms it has no way to enforce such limits at all. Instead, when monitoring is enabled then the node-manager periodically queries the OS to see how much memory the created process is using, and kills it if it has exceeded the size of the container (ie originally requested size rounded up to multiple of base units of memory). This monitoring is not 100% reliable - if a started process forks then the node-manager can lose track of which processes belong to the container and cannot track/kill them. However this is not a problem when using MapReduce, Tez, or Spark - the processes that those frameworks generate don’t fork.

When the process started in the container is a JVM (and it always is for MapReduce/Tez/Spark) then the JVM itself enforces a memory limit - JVM commandline parameter -Xmx specifies how much memory may be used for the user heap. If a Java application within that JVM allocates memory, and there is no more space on the current heap, and the -Xmx limit has been reached then the JVM throws an OutOfMemoryError. An application can catch this, but normally does not - in which case the JVM terminates.

The JVM will use more than the -Xmx limit: it also needs space for the stack, class cache, and for the JVM’s implementation itself. In addition, in recent JVMs applications have the ability to allocate “off heap memory buffers” which are not restricted by the -Xmx limit.

One thing that initially confused me is that recommendations from the internet set the container-size config options (eg mapreduce.map.memory.mb for mapreduce) to a larger value than the -Xmx value. Hopefully the above description makes the reason clear: the container size enforced by the node-manager needs to be large enough to hold the entire JVM in memory, not just the Java heap (-Xmx).

All of the above applies to both “task hosts” and the application master process.

Tez Sessions

One additional complication is that Tez supports “sessions”. When using an interactive tool such as the hive cli, beeline or a web-based tool such as zeppelin then they start a Tez session which starts and holds onto one or more containers which are reused for multiple queries. This allows interactive responses to be much quicker (no need to wait on each query for Yarn to allocate the necessary containers) but does mean that changing config settings within the interactive session does not affect the already-created containers - only new ones.

Solutions to the Memory Issues

When using Hive-cli with Tez, the size requested for containers in which “tez task host” processes are started is set by a Hive config option. This option limits the overall amount of memory that a yarn node-manager will allow such processes to allocate; as noted earlier the node-manager periodically checks the size of the process running in the container and kills it if it exceeds the container’s requested size.

As Tez tasks only use on-heap memory, the -Xmx option used to start these processes is also critical - increasing the container size will not help when the JVM itself cannot expand.

Below is a description of how to change the container size, and the JVM heap size.

Options for node manager monitoring

These options are not settable by users, but just for reference:

  • yarn.nodemanager.pmem-check-enabled – enables periodic monitoring of the resident-set-size memory usage (physical memory) of the process running within a container (default:true)
  • yarn.nodemanager.vmem-check-enabled – as above, but monitors virtual memory usage, ie includes swapped-out data (default: false)

Options for container size control

Now comes the complicated part - there are various overlapping and very poorly documented options for setting the size of Tez containers.

According to some links, the following options control how Tez jobs started by Hive behave:

  • hive.tez.container.size – value in megabytes
  • hive.tez.java.opts

and when these are not set, then under Hive the following are used as fallbacks:

  • mapreduce.map.memory.mb
  • mapreduce.map.java.opts

However other documentation indicates that the following options control how Tez jobs behave:

  • tez.task.resource.memory.mb – requested memory for Yarn containers running “task host” processes
  • tez.task.launch.cmd-opts – args for JVM instance started in above containers
  • tez.am.resource.memory.mb – requested memory for the Yarn container running an application master process
  • tez.am.launch.cmd-opts – args for JVM instance started in above container
  • tez.container.max.java.heap.fraction (default: 0.8) - see below

In my particular environment (using hive cli tool with Hortonworks Data Platform 2.5.3), the latter options (pure tez ones) were relevant. Setting these in the hive cli before starting the query allowed it to run. I am not sure what effect the hive.tez.* options have, or why they exist.

UPDATE: shortly after writing this article, I triggered another kind of OutOfMemory error in Hive. That time, modifying hive.tez.container.size did have an effect - it let the query run long enough to at least log the real cause of the problem. Clearly there is some interesting (and apparently undocumented) interaction between the hive.tez.* and tez.* config params.

What is elegant about tez.*.launch.cmd-opts is that when options Xmx and Xms are not specified, then Tez calculates them via tez.container.max.java.heap.fraction * tez.*.resource.memory.mb. This means that as user there is just one value that needs to be configured (the resource.memory.mb) rather than also needing to configure the java opts. Tez java code does not allocate off-heap-memory, so using a ratio to compute heap space is fairly reliable.

Under HDP (at least) there is a configuration-file /etc/tez/conf/tez-site.xml which sets defaults for the above. These values are visible when running set -v from within the hive cli.

MapReduce configuration options are split into map-specific and reduce-specific variants. Tez does not have pure map or reduce phases; “task host” containers can run either type of task.

Config option tez.queue.name can also be useful; it controls which Yarn workqueue a job is submitted to (rather than inheriting a default from the hive cli). According to one of the links in the references section, setting this value causes queries to be launched in new yarn containers rather than reusing containers belonging to the current Tez session.

Options for JVM option control:

For my specific case (simple query against very large table), the following allowed the query to complete:

hive
  --hiveconf queue=myqueue
  --hiveconf tez.am.resource.memory.mb=16394
  --hiveconf tez.am.launch.cmd-opts=""

Overriding tez.am.launch.cmd-opts was necessary because the sysadmins for the cluster had incorrectly included an explicit -Xmx=3686m in the default value (in file /etc/tez/conf/tez-site.xml). Removing this value allowed Tez to automatically calculate it vla tez.am.resource.memory.mb * tez.container.max.java.heap.fraction.

I used --hiveconf rather than starting the hive cli and then using set because I was concerned about the job being run within containers belonging to the Tez session, and that being set up using defaults rather than the explicitly-set values. Maybe this would not have been a problem, but setting config before starting the hive cli seemed safer.

Just as a last note, the hive cli tool is supposedly deprecated in favour of the beeline tool. However at the current time (at least with HDP2.5.3) beeline is far inferior to hive, and rather buggy.

Last Notes

As mentioned at the start, the content here is based on fragmentary information from the internet, experiments, and guesswork. If you have additional information or corrections, please let me know!

References