Apache Yarn Overview

Categories: BigData

back to overview

Introduction

This is a discussion of the Yarn component of the Apache Hadoop “big data” project.

Yarn is a system for distributing processes through a cluster, ie running an application on some arbitrary node in a cluster of servers. A client application builds an application context consisting of a set of arbitrary files (“resources”), a set of constraints on the kind of node that should be chosen, and then a list of commands to execute. The application context is submitted to the primary yarn server (“resource manager”) which then chooses a suitable node from the available cluster, unpacks the resources there, and executes the specified commands. Typically the resources include files that the commands “run”; they might be bash scripts, python scripts, java jars, or potentially native code.

Yarn Installation

Yarn consists of the following components:

  • resource-manager daemon
  • node-manager daemon
  • client libraries

A typical Yarn cluster consists of a series of racks in a datacenter, each with multiple “blade” servers running Linux (either booted from a local disk, or from an OS image on network-accessable storage).

One of the servers runs the resource-manager daemon, and all others run the node-manager daemon. For high-availability, it is possible to run “standby” instances of the resource-manager daemon on a few additional nodes, but only one resource-manager is considered “active” at any time.

Each “worker node” (running the node-manager daemon) is configured with the network address of the resource-manager, and registers with it on startup.

As the yarn daemons are implemented in Java, each server also needs a suitable JVM installed. Otherwise, there is little unusual about the physical servers or their configuration - just a reasonable CPU and adequate amounts of RAM.

The resource-manager daemon listens on a TCP port; the worker nodes and client applications connect to it and communicate using messages in Thrift protocol format. Because Thrift is language-independent (unlike Java’s native RPC for example), clients can theoretically be implemented in any language. However the message format is not officially “stable”, and the Hadoop project only provides client libraries in Java.

Submitting an Application Context

A client application can build a request to execute an application, and submit it to the Yarn resource-manager. The request is sent over a normal TCP socket, and is in Thrift format so theoretically such a request can be built in any language with a Thrift client library. However the message format is not officially stable. It is therefore best to instead use the Yarn client library to build such requests and pass them to the Yarn resource-manager - and Hadoop provides only a Java client library implementation.

The first step for a client is to connect to the Yarn resource-manager and request a (unique) application-id. Once this is returned, the client (usually via the client library API) builds a datastructure with the following components:

  • the application-id (as allocated by the resource-manager)
  • an application name (not necessarily unique)
  • a work-queue name
  • a set of resource files (such as configfiles, scripts, jarfiles, or executable binaries)
  • a set of node constraints
  • a set of flags such as whether to auto-restart the application on failure
  • and finally, a list of commands to execute

This datastructure is then sent to the Yarn resource-manager, which places the request on the specified work-queue.

Any client application can connect to the yarn resource-manager and ask it for information about an application-context, searching either by application-id or by application-name. The original submitter of course has the application-id, and should use it. Other applications can use the application-name string, but must be aware that there could potentially be multiple applications using that name. The status initially shows the job as “waiting”; once the job has started execution on some node in the cluster, additional information is available - including custom data that the application can itself register. Common custom status information include (IP/Port) address on which the application might be listening, counters for various events occurring in the application. Once the application has terminated, this information and the termination status is also available via the resource-manager.

The resource-manager keeps hold of the submitted application-context until the application has successfully terminated. If it terminates unexpectedly then yarn can optionally restart it.

Yarn provides a simple “command line client” which is just a wrapper over the standard libraries; command “yarn jar …” can be used to submit a suitable java application:

  • an “application-id” is auto-allocated and printed as output of “yarn jar”
  • its “application name” is ??
  • it is placed on the default queue, and run on any node
  • the associated “command” is simply “java -jar {theJarName}

The Yarn resource-manager daemon provides (since v2.5.0) a REST API for querying status information and submitting new application requests. The input/output request formats are actually a pretty good guide to the options available via the programmatic API too; see “Cluster New Application” and “Submit Application” in the above link.

Scheduling

Yarn includes a scheduler which is continually monitoring the state of its available worker nodes, looking for nodes which are able to run jobs waiting on its work-queues. The process of matching waiting jobs with available nodes is not trivial, and there are many many configurable options for tuning the system, assigning certain queues higher priorities than others or ensuring that each work-queue gets a fair time-share of the available cluster capacity, etc. Scheduling won’t be discussed here in detail; it will be assumed that the cluster has enough capacity to immediately run submitted jobs.

The application context specifies a set of “node constraints” which the scheduler takes into account. Examples are:

  • run only on the worker node with a specific id
  • run near the worker node with a specific id
  • run only on a worker node with a specific label
  • run only on a node with at least N GB of free memory
  • run only on a node with at least N free CPUs

Executing a Queued Job

Once the scheduler has found a suitable node, the resource-manager sends the entire application-context over the network to the chosen node.

The nodemanager first creates a temporary directory for the new application, and unpacks all the resource files in the context into this directory. Some of the resources may be URLs (and in particular, references to files in HDFS) in which case the nodemanager will automatically download that file into the temporary directory. Distributing such files via HDFS is a very efficient way to ensure code running in Yarn has quick access to the resources it needs; tools such as the Hadoop MapReduce library take advantage of this feature.

The nodemanager then executes each of the specified commands, as separate processes. Examples of commands are:

  • /bin/sh myscript.sh
  • /usr/bin/python3 myscript.py
  • ${JAVA_HOME}/bin/java -jar myapp.jar -classpath mylib1.jar:mylib2.jar

where the specified scripts/jarfiles were included in the application-context as resource-files. This process does make some assumptions about the worker node, eg that “/bin/sh” exists, or python or java are preinstalled.

Isolating Executed Jobs

The environment in which a job is executed is called a “container”. However this is just a generic term; the actual mechanisms available are listed below.

In its simplest mode, a nodemanager daemon just performs a fork, changes current-working-directory to the temporary dir created for the application context and executes the commands as the same user the nodemanager daemon is running as. Of course this approach is vulnerable to accidents or malicious behaviour. The nodemanager is normally installed/started using a system account “yarn”, which does limit the danger somewhat but a malicious job can certainly interfere with other jobs running concurrently on the same node, and with the nodemanager itself.

An option on some operating systems (eg Linux) is to run all application-contexts in a single dedicated user account, eg the standard user “nobody” or a user specially defined for this purpose. This makes it more difficult for an application to interfere with the nodemanager itself, but concurrent jobs can still interact in undesired ways.

When the cluster is configured to use Kerberos, each request to the resource-manager is accompanied by a Kerberos ticket proving the identity of the job submitter. It is then possible to configure nodemanagers to run each submitted job as the Linux user account belonging to the submitter. This requires the relevant user-accounts to be defined on every server running the nodemanager daemon (or at least on all nodes that the application context constraints might select).

To achieve this “run as other user” behaviour on Linux, the Hadoop install includes a Linux binary application $HADOOP_HOME/bin/container-executor. This is initially installed with owner=yarn; enabling “run as other user” requires changing its ownership to “root” and setting the suid-bit. Yes, this has significant security implications - I’m not sure what measures the application takes to limit the commands it can carry out. The nodemanager configuration supports “whitelists” and “blacklists” to limit the accounts that it will run containers as. In particular, a minimum-user-id can be defined (default:1000) to ensure yarn applications are never run as system-user accounts. Of course, no request with a valid Kerberos ticket for such an account should ever be received.

On Linux, the nodemanager can be configured to use Linux Containers to create a (mostly) isolated environment for each application-context. The executed commands run with a limited filesystem, and optionally CPU and ram quotas which really reflect the settings in the application request (without containers, these are just voluntary limits).

On Linux there is also the option to start each application-context in a docker container. The container image is specified in the nodemanager config, not in the application-context, ie applications cannot choose which image to be run in. Nevertheless this is a very nice way to provide a stable and predictable runtime environment for jobs in a Yarn cluster, regardless of the configuration of the OS on which the nodemanager itself is running.

Warning: Linux containers still have significant security issues. Using one is more secure than running as the nodemanager user, but less secure than running as another user account.

Warning: with the current state of Docker, the Docker approach is not as secure as running as a separate local user. Linux containers have only recently gained “user namespace” support in which root user privileges within the container can be mapped to non-root privileges outside the container. Docker has (at the current time) not gained supp`ort for this feature; it is therefore not too difficult for a malicious application within a docker environment to obtain elevated privileges within the host operating system.

Of course security is not always a major concern when using clusters for “big data” processing; the code (application contexts) being run in such environments is often from “trustworthy” sources.

Environment Variables

The original Application Context can define environment-variables that should be set before invoking the commands. A Yarn nodemanager also sets a bunch of standard variables that executed applications can rely on, including $JAVA_HOME and a $CLASSPATH which includes common Hadoop libraries.

Reusing Jobs

An application context is also sometimes referred to as a “container” - and it can reasonably be compared to something like a Docker container, as it is a set of resources providing an environment in which to execute some command.

It is common for the “same task” to be executed many times, and thus Yarn makes some attempts to cache contexts on worker nodes, in case the same job is executed again on the same node in the near future.

Heartbeats and Status Reports

The application executed by a nodemanager should immediately connect back to the resource-manager and report its status; environment variables are provided by the nodemanager with useful data such as the resource-manager address. For long-running programs, regular status reports (heartbeats) should be sent. These messages are in Thrift format, as with all communication to the resource-manager.

This channel from executing job to resource-manager can also be used by the started application to queue additional jobs; MapReduce uses this extensively. Yarn then recognises the parent/child relation and handles retries/notifications appropriately. When using this design-pattern, the single job submitted by the original client process is referred to as an “Application Master”, and the subjobs it spawns are referred to as “tasks”; the Application Master then manages and monitors the tasks it spawns, and only terminates when all of its child tasks have completed.

Application Masters and Jobs

An application started by making a request from a client to the Yarn resource-manager is called an “Application Master”, or AM.

An AM must connect back to the Yarn resource-manager to report status. It may use the same communication channel to request that additional “containers” (execution environments) be reserved for it in the Yarn cluster; these requests may specify the same sort of constraints that a client application specifies when requesting the launch of an ApplicationMaster, eg WorkQueue, memory-size, cpu-type, and various other attributes. If the request can be satisfied, then the resource-manager returns the id of the yarn nodemanager that can provide a suitable container. The AM may then communicate with that nodemanager to specify that specific resources be unpacked into that environment and specific commands be executed - similar to how the AM itself was launched. These “subtasks” do not have their own unique “application name” and status information within the Yarn resource-manager; an AM is responsible for managing code executing in the containers it allocates.

In Hadoop MapReduce programs, a program (eg a complex “query” over large amounts of data) produces a single Application Master application which then allocates containers on or near the HDFS datafile blocks that need to be read, and starts code in those containers which then processes data from those datafile blocks, ie the code has been brought to the data for efficient IO. A MapReduce program that needs to process an HDFS file that is 100 blocks large does not necessarily allocate 100 Yarn “containers” before starting; the appropriate level of parallelism is determined more dynamically.

An Application Master does not have to allocate containers for subtasks; it is valid for an AM to perform logic itself. It is also valid for an AM to run for long periods of time (or until explicitly shut down).

Auxiliary Services

Yarn provides the ability to specify zero or more “auxiliary services” in a Yarn nodemanager configuration file; these are java jars that are loaded and invoked when the nodemanager is started.

The primary use for this is to start an embedded MapReduce daemon which is responsible for providing Reduce tasks with access to the output of Map tasks. A Hadoop MapReduce map task will produce a bunch of files on the local filesystem (one for each reduce partition) then exit. The reduce nodes later connect back to the same yarn node to fetch those files via http, and something needs to be running locally to serve them up.

Auxiliary services run as a library/plugin within the same process as the nodemanager. A nodemanager is thus still just one unix process.

Use Cases

The ability to run anywhere in the cluster, to rely on the resource-manager to restart an app if it (or its host node) dies, and the ability to make the current IP/Port information available via a search of the yarn resource-manager using an application-name makes a yarn cluster a nice place in which to execute long-running services/daemons. Compared to the alternate approach of configuring the service to run on a specific machine, many administration tasks become easier. Service dies or host machine dies? Yarn will auto-restart. Need to scale up to multiple copies of the same service? No big deal, just submit some more copies of the same context to the yarn resource-manager.

The ability to run many short-term programs is also useful. To process a hundred files in parallel, submit the same context 100 times with a different filename for each. Or in the case of HDFS, process each chunk of a large file in parallel by submitting the same context once for each file-chunk.

Summary

The fundamentals of Yarn are really simple: it transfers a set of resource-files to some suitable host and then executes a list of commands there.

There is nothing language-specific about clients that subnmit requests; they just need to be able to send an appropriate Thrift message to the yarn resource-manager. There is also nothing language-specific about the executed commands - the OS on which the yarn nodemanagers execute just need to be able to “bootstrap” the command (eg support python when the command is a python script, or java when the command runs a java jarfile). However in practice, because yarn is implemented in Java, life is easier when clients and commands are implemented in Java (or at least JVM langauges).

References