Apache Hadoop Overview

Categories: BigData

Introduction

This is a quick overview of the components of the Apache Hadoop “big data” project (version 2.7), with links to articles discussing the individual parts in more detail.

When you go to the Apache Hadoop website, download the latest distribution and unpack it, you find:

  • Yarn daemons, client-libraries and commandline tools, and some admin scripts
  • HDFS daemons, client-libraries and commandline tools, and some admin scripts
  • MapReduce daemon, and client libraries.

All of these are standard user-space code, intended to be run on top of standard operating systems - primarily Linux, although Windows is also supported (if you’re a masochist) and they probably work on BSD, etc too. All the Hadoop components are implemented in Java, although other languages are at least partially supported as clients; see the articles on individual components for more details.

Yarn is a system for distributing processes through a cluster, ie running an application on some arbitrary node in a cluster of servers. A client application builds an application context consisting of a set of arbitrary files (“resources”), a set of constraints on the kind of node that should be chosen, and then a list of commands to execute. The application context is submitted to the primary Yarn server (“resource manager”) which then chooses a suitable node from the available cluster, unpacks the resources there, and executes the specified commands. Typically the resources include files that the commands “run”; they might be bash scripts, python scripts, java jars, or potentially native code.

HDFS is a system for distributing files through a cluster. A client application asks the HDFS primary server (“namenode”) to store a file; HDFS splits the file into multiple chunks, and saves the chunks on storage-devices attached to different nodes within the cluster. Each chunk is stored multiple times to make the filesystem resistant to failures of storage-devices, whole servers, or network switches. When a client application wants to retrieve the file, it is assembled from the “most convenient” copy of each chunk. Spreading a file across the cluster in this way supports storage of very large files, easy expansion of the overall storage capacity, and robustness. It also increases total IO throughput when reading a file, as different chunks are on different servers (and thus different physical storage devices).

Yarn and HDFS are completely independent. It is quite reasonable to use Yarn without HDFS; it is also quite reasonable to use HDFS without Yarn. However they can be very effectively combined together to both store data and process it. Commonly, the same physical servers run both Yarn and HDFS daemons, ie act simultaneously as data-storage and data-processing nodes.

Hadoop MapReduce is somewhat different from Yarn and HDFS; it is probably best described as an “orchestration library” that arranges for user code to be efficiently executed via Yarn against data stored in HDFS. The MapReduce part of the Hadoop distribution does provide a “job history” daemon service, but it isn’t very important - the primary part of MapReduce is the helper libraries it provides for using Yarn/HDFS.

MapReduce is now considered somewhat obsolete; programming languages such as Spark and query-languages such as SparkQL, HiveQL or Pig also help users to efficiently apply their code against data stored in HDFS via Yarn, and the APIs they provide are generally considered superior to the original MapReduce approach. Nevertheless, it is useful to understand MapReduce as it is a somewhat simpler conceptual model than the other tools and so gives a good starting point for understanding big data processing.

A Yarn cluster can potentially be set up in an “IaaS cloud” such as Amazon EC2, although it is more normal to set one up directly on “bare metal” (or possibly on bare metal servers running a hypervisor such as Xen for low-level management). Whether virtualizing an HDFS cluster makes sense is debatable; the default approach to HDFS is to co-locate the HDFS datanode daemons with physical diskdrives in which case virtualization makes little sense. Persisting HDFS data in a central block-based storage system has performance disadvantages, but does allow the datanode daemons to then be deployed in a generic cloud. See the HDFS article below for more information.

More details can be found in the following articles on:

Configuration Tools

Apache Ambari is a server application with web interface which acts as a “manager” for a cluster of servers, rolling out and configuring additional Hadoop components. Sadly, configuration defined via Ambari is not version-controllable. The Cloudera hadoop distribution provides “Cloudera Manager” which is basically Apache Ambari.

Bigtop.apache.org provides non-ambari-based setup tools for hadoop and related projects. DEB and RPM packages are provided for hadoop-related components (including HBase). Puppet scripts are available to set up various components on pre-existing server nodes (uses the above packages). See https://github.com/apache/bigtop/tree/master/bigtop-deploy/puppet. Bigtop also provides Vagrant scripts to execute the above puppet scripts on dynamically-allocated VMs, or in dynamically-allocated docker containers (in-progress).

References

  • “Hadoop: The Definitive Guide 4th Ed.”, 2015, Tom White, O’Reilly – 4th edition or newer is definitely recommended

    This book covers HDFS, YARN, MapReduce, Avro/Parquet, Flume, Sqoop, Pig, Hive, Spark, and HBase.

  • “Hadoop Application Architectures”, Grover et al, O’Reilly.

    This book describes how to apply HDFS, HBase, Hive, MapReduce, Spark, Pig, Giraph, GraphX, Oozie, Storm, Flume, etc. Knowledge of the content of “Hadoop: The Definitive Guide” is assumed.