Zookeeper Overview

Categories: Programming, BigData

Overview

The official zookeeper documentation is pretty good, but a few items are missing - and in particular, the “big picture” overview. This brief summary is intended to cover the information I had to piece together myself during a recent project.

Apache Zookeeper is an application which runs as a server, allowing client applications to read or update a simple persistent tree of values. Somewhat like a remote filesystem, the key for retrieving, inserting, updating or deleting a value is a path, and the values themselves are arbitrary blocks of bytes. Unlike simple key-value stores, the use of a structured “path” as a key allows clients to also “watch” paths, ie request notifications when data at or below that path is modified. Zookeeper is intended to act as a coordinator for multiple processes, not a distributed database : the size of each value in the tree is expected to be small (a few tens of kilobytes at maximum) and the entire tree is kept in-memory.

Clustering

Multiple Zookeeper instances can optionally be clustered together to provide resilience against failure of a node or failure of network connectivity to a node, and to improve performance when reading data (at the cost of slower updates). The members of a cluster dynamically choose one node to be the leader of the cluster, and all updates performed by a client application against one of the zookeeper instances are forwarded immediately to the current leader which in turn notifies all other nodes of the cluster. On failure of the leader node, a new one is “elected”.

Because data updates always flow from the node at which the change was first initiated up to the leader then down to other nodes, network traffic effectively forms a star network topology. However the process of leader election, and the fact that any node can potentially become the leader, requires that all nodes be able to directly access all other nodes.

To avoid problems with cluster “partitioning”, each node in the cluster is aware of the maximum set of nodes in the cluster - ie is configured with info about even nodes that are not currently running. A cluster is only valid if at least 50% of the potential nodes are accessable; when a node starts it refuses to allow client applications to connect to it until it has verified connectivity to at least 50% of the potential nodes (reached a quorum). This ensures that no client application can ever see out-of-date information by connecting to a zookeeper node which has somehow become disconnected from the cluster leader (ie disconnected from the majority of cluster nodes).

Initially installing and configuring a cluster is done manually and is fairly static: after unpacking the zookeeper software on a node, it must be configured with a unique node id (an integer in range 1..255) and given a list of the network addresses (host:port) of all other nodes; when adding a new node to an existing cluster, all other nodes must of course have their list-of-node-addresses updated to include the new node and be restarted. Note that zookeeper servers do not use any kind of auto-discovery to find other nodes; configuring a cluster is done manually.

Version 3.5 (currently in alpha state) adds a major feature : dynamic cluster reconfiguration in which the information about the set of nodes is itself distributed across the network. See:

Clients

ZooKeeper happens to be implemented in Java, but that’s not important as it is intended to be a standalone process. The same jarfile in which the server is implemented also contains a class that java client applications can use to connect to a server node. There is a separate client library for C, and various other languages are supported via bindings to the C library.

The Apache Curator project provides additional Java client libraries for Zookeeper which are highly recommended. The Curator Framework library provides an alternate way of connecting as a client application to a zookeeper server, including auto-retry of read or update operations if the current server node fails during the operation. As with zookeeper’s inbuilt Java client, there is no auto-discovery of server nodes : the client needs to be given a list of available nodes that it may connect to.

Zookeeper is sometimes used as a service-discovery mechanism for other communication protocols (eg remote osgi services); applications which publish services connect (statically) to a zookeeper server and create entries in the tree describing their address and services, and applications which wish to use services connect (statically) to a zookeeper server and search the tree for information about the services they wish to find. This does decouple the service publisher and consumer effectively, but client->zookeeper and zookeeper->zookeeper connections are always explicitly configured (not discovered).

Additional Functionality

While the basic tree-store and clustering functionality of Zookeeper is itself useful for many applications, it is also possible to use this as a base on which to implement more sophisticated cluster synchronization operations such as distributed semaphores and counters. The Zookeeper website documents many of these basic algorithms, and the Curator project provides a Recipes library that implements them for Java clients.

The Values Tree

Each zookeeper node keeps all data from the tree of values in memory - thus the tree cannot be overly large. It is also very careful to write all changes to the filesystem; a change to the tree is not considered complete until it has been flushed to disk.

However when a client application creates a node in the zookeeper tree (called a “znode”) it can be explicitly marked as ‘ephemeral’ in which case it will be removed as soon as the client which created it disconnects from the cluster. This is particularly useful for clients which ‘register services’ of some kind as nodes in the tree.

Each znode has a version number associated with it, and this is incremented on each update. Normally client applications ignore this and simply add a ‘watch’ on a node to get a callback when the node changes, but direct access to a version-number can sometimes be useful.

Embedding a Zookeeper Server

The authors of the Zookeeper implementation intend it to be used as a standalone process; although it is implemented in Java and distributed as a plain java library (jar), it isn’t really intended to be run in the same JVM process as other java code. Nevertheless when building a clusterable application in Java it can be useful to “embed” zookeeper in this way, and with some effort it is possible; the FuseESB product does this and a colleague of mine has also achieved it. There are some quirks however, and you won’t get much help from the documentation or email lists.

Even when a zookeeper instance has been successfully embedded, code in the same process must still connect to it via a network socket. Unlike libraries such as ActiveMQ, there is no protocol for in-process communication as the implementers did not intend it to be used that way.

References