Apache Cassandra Overview

Categories: BigData

Introduction

Cassandra is a BigTable-like database, ie stores tabular data (very roughly relational-like). Unlike relational databases, it is “schema free”. This article provides an introduction to Cassandra, and compares it to the very similar HBase and the somewhat similar Hive. It is assumed that the reader is familiar with HBase already (or has at least read the linked article).

The primary differences between HBase and Cassandra are:

  • HBase has a single master server for each “region” of data, thus does not provide high availability but does automatically provide consistency. Cassandra is a multi-master system, thus failure of a single node does not cause an outage - but complicated mechanisms are needed to ensure data consistency when concurrent writes are applied to the same data via different servers.
  • Cassandra installation and administration is significantly different (mostly simpler).
  • Cassandra defines and supports its own query language (CQL). HBase provides only an API; query-langages are provided by external projects.

What is Cassandra

Cassandra is an extremely scalable NoSQL database. Cassandra’s data-model roughly resembles a relational database in that data is modelled as tables with rows. However it doesn’t fully implement the relational model - no joins, subqueries, or transactions. In place of joins for modelling 1:N relations, it supports “collections”. It provides its own SQL-like query language “CQL”.

Like HBase, Cassandra deals well with OLTP-style loads: good performance for random reads of individual records, reasonable performance for updates. It also deals reasonably well with OLAP-style loads: reads of very large numbers of records. Spark-based tools may be used to process data in-place. It was previously possible to use Hadoop MapReduce with Cassandra but this community-maintained code has “bitrotted” and no longer works against current Cassandra releases.

Cassandra clusters very well - all members of the clusters are peers (no master/slave stuff). Scales to very large amounts of data. The rows of a table are partitioned (sharded) across all members of the cluster using the value of the row’s primary key. As is usual with a “multi-master” cluster, there needs to be a mechanism to resolve consistency between nodes. Cassandra supports various options (“tunable consistency”). Cassandra does not use YARN or other cluster-management tools; Cassandra nodes are “long-running” processes (like HBase and unlike Hive).

At the lowest storage levels, Cassandra uses the SSTable algorithm to store records (as HBase also does).

Cassandra can be accessed from Java via JDBC, from Python via DBAPI2, etc. - the Cassandra “row/column” model is close enough to relational for these APIs to still be valid.

In earlier versions of Cassandra, the name “column family” was used; newer versions (and newer documentation) uses the term “table” instead. A Cassandra “keyspace” is roughly equivalent to a relational “schema” (ie a namespace holding multiple tables).

Warning: at the current time, the wikipedia article on Cassandra describes it as “column oriented”, and links to the Wikipedia page on “column oriented” datamodels. AFAIK, the Wikipedia articles are very misleading - see this article on “column oriented” databases.

See Also:

Sharding vs Regions

HBase divides the contents of each table into a set of regions (key-ranges); each region is hosted on a different server. HBase maintains a table of (min-key, max-key, server) so that for any specific record, the responsible server can quickly be found. When a range grows too large (contains too many records), it can be split into two parts and one of the parts migrated to another node. Each node can manage multiple regions (multiple min/max ranges).

Cassandra functions somewhat similarly, the major difference being that it

  1. explicitly separates the record key into two parts - partitioningkey and clusteringkey, and
  2. chooses a “shard id” (aka “token” in Cassandra documentation) for the record via hash(partitioningkey) modulo numshards.

The number of shards (partitions) for a table is chosen when the table is defined. A “token” thus represents the set of records where hash(partitioningkey) modulo N maps to the same value. Cassandra maintains a table of (token, server) so that for any specific record the responsible server can quickly be found. A single server can manage multiple tokens (shards). As a cluster is expanded with new physical nodes, existing shards/partitions (the set of records associated with a specific token) can be migrated from their current host to a new host. However (unlike HBase) the contents of a shard cannot be split into new shards; the number of shards in a table is fixed. A table can therefore be distributed over one host (all shards on the same node) up to numshards different hosts (one shard per host).

As Cassandra supports a replication factor for each table, there are actually N servers for each token (shard), each with a full copy of the data. Cassandra therefore actually maintains a (token, list-of-servers) table, rather than the simpler (range, server) mapping in HBase. This increases both the read-throughput (reads can be performed against any node) and the write-throughput (writes can also be performed against any node - though the consequence is the need for a complex conflict-resolution system).

The Cassandra approach gives slightly more control to the database architect; the split between “partitioning key” and “clustering key” is explicit, and “scanning” over all records with the same clustering-key is efficient. In HBase, the same distinction exists but is implicit: the “partitioning key” is the record-id prefix and the “clustering key” is the record id suffix (because records are split by ranges).

The Cassandra approach has higher read and write performance (multiple hosts per shard due to replication). The multi-master approach also makes the data “highly available” (outage of a small number of nodes in the cluster does not affect clients). A poorly-chosen partitioningkey can also lead to “hot spots” where a single server is overloaded while others sit idle; HBase has the same issue though Cassandra multi-master replicas at least reduce this load somewhat. However the Cassandra implementation is more complex than HBase, and is difficult to scale past the point where nservers = nshards.

Consequences of Multiple Masters for Data Storage.

HBase does not replicate the data that it stores - at least not at the database level, although data is stored replicated in HDFS. The HBase approach is simple to understand, and simple to implement - all reads and writes for a range of record-keys go to a server “dedicated” to handling that range (region). There are never any races or other unusual interactions between reads and writes, because they all pass through the same cluster instance. There are however two obvious drawbacks:

  • A single server for a range of record-ids can lead to “hot spots” (limiting scalability particularly on read), and
  • On failure of that node, failover to a “cold standby server” can take tens of seconds.

Cassandra instead allows a “replication factor” to be defined for a table, and there are then that number of active servers serving the same data. There is of course a price: things get more complicated. As with HBase, the individual servers use SSTables for storage, and updates are first written to an edit-log then to an in-memory buffer, with the memory-buffer being flushed to disk as an SSTable only periodically (to avoid generating many small SSTable files). The existence of these in-memory buffers plus multiple processes serving the same region means that the processes need to explicitly replicate data between themselves in near-real-time. This leads to the first user-visible difference between HBase and Cassandra: Cassandra reads and writes need to specify a consistency level, eg ONE, QUOROM, or ALL. On write, this indicates how many of the “replicated instances” need to confirm the write has been accepted before the write is considered to have “succeeded” (and the writing process may continue).

Writes with consistency level of ONE is of course fastest, but means that other nodes might return stale data for a while (the remaining replication is asynchronous). Writes with consistency-level of ALL guarantee that no other node will return stale data after the write has completed, but will fail (ie is not HA) if any of the “replicates” is on a nodes in the cluster which is down. QUORUM means that the write blocks until over 50% of the replicated copies of that row have received the data.

Reads specify the same levels; a read with level ONE is fast - only one of the servers holding a replicate of the row needs to be queried. However it can return “stale data” if a write was recently made via a different instance. A read of ALL guarantees consistent data is read, even when the writer used level ONE, but is obviously slow and not HA-compliant.

The QUORUM read level is used together with the QUORUM write level, ie writer and reader need to agree to use QUORUM otherwise it is no better than ONE. When both writes and reads use QUORUM, then both are reasonably performant and HA-compliant.

When using write-levels other than ALL, there is a common problem: a process which writes a record then reads the same data back might see “stale” data due to reading from a server other than the one it wrote to. The Cassandra client libs generally work around this problem by tracking writes, and ensuring that reads from the same table within a specific time-period are directed to the same node on which the write was performed. This reduces performance very slightly, but ensures a “stale read of just written data” never happens. This of course does not prevent other processes from seeing the stale data.

As with HBase, cells are versioned with timestamps which makes replication of data much simpler - there are no problems with ordering of conflicting writes performed via different nodes in the cluster (unless through bad luck exactly the same timestamp happened to be used, in which case it isn’t clear what happens).

Cassandra’s decision to serve the same data from multiple servers caused a problem (multiple in-memory buffers) which had to be solved with real-time replication. While complicated, this replication mechanism has some other advantages - in particular, Cassandra no longer needs to store its SSTable and edit-log files in a distributed database. HBase relies on HDFS (or other DFS) to replicate data for reliability, but Cassandra’s direct node-to-node replication provides the same resistance against disk-failure and node-failure without needing a DFS at all, simplifying life for sysadmins. This does mean that MapReduce/Spark/etc cannot process Cassandra data by running directly over the underlying database storage files, but even with HBase that is not a normal way to bulk-process data.

A second consequence of allowing multiple servers to hold the same rows is that regions can be more fine-grained. If a Cassandra table defines the “partitioning key” to be the whole primary-key for the record then records within the table are distributed evenly across all shards of the table, improving both write and read throughput. When the number of shards is greater than or equal to the number of nodes in the cluster, then the table is distributed evenly across the whole cluster. This is not advisable with HBase, as an outage of any single node in the cluster would then impact 100% of all tables.

Cross-data-center Replication

For disaster-recovery and performance it is often important to replicate data not just within a cluster, but between clusters running in different datacenters. The significantly higher latency over network links between data-centers makes this a different problem.

HBase has a solution for this, involving transfering the “edit logs” (aka transaction logs) between clusters. Setup is non-trivial but also not particularly complex.

Cassandra simply uses its standard replication mechanisms for intra-cluster replication to transfer data between clusters - very little setup needed.

Configuration for High Availablility

As noted above, Cassandra’s ability to serve the same data from multiple nodes concurrently (with associated near-real-time replication) makes it very high-availability. The fact that a Cassandra cluster consists solely of a set of identical Cassandra processes, with no external dependencies, means that there are no single-points-of-failure and that setup/administration is very easy.

HBase is more complex, having several components:

  • HDFS namenode
  • HDFS datanodes
  • HBase master
  • HBase region-server
  • ZooKeeper

All of the above can be set up in high-availability mode, ie a properly-configured HBase system has no single points of failure. However getting that configuration correct is far more complex than simply installing N Cassandra nodes. There are tools to help, eg Apache Ambari, but simpler is better than complex+help.

And as noted above, on failure of an HBase region-server it can take tens of seconds for the regions it manages to be reallocated to other surviving region-server instances, as they need to replay the edit-logs from the failed server (which are in the DFS) before beginning to serve data.

It is possible that in future HBase will have faster failover - it seems feasable to have “warm standby” servers for each region which continually apply edit-logs from the “current master” for the region in order to be “nearly ready” to take over.

The multiple-servers-with-replication approach is inspired by the Amazon Dynamo key-value-store system. Cassandra’s datamodel is inspired by BigTable.

Performing Reads and Writes

In HBase, a client first determines which region-server manages the data to be read or written, and then communicates directly with that node.

In Cassandra, a client instead simply chooses a node at random. If the node does not itself have the relevant data, it forwards the request to a suitable node. This potentially doubles the amount of network traffic in comparison to HBase, and introduces latency due to the need to forward (copy) the data in both directions. However from the client side it does simplify things somewhat.

Defining and Querying Tables

The HBase native API for storing and retrieving data is very “direct” and low-level: SCAN and PUT. These operations correspond directly to the way data is stored. HBase itself does not include any higher-level query language; even the “hbase shell” accepts just SCAN and PUT commands. Those wanting an SQL-like interface to HBase need to use Hive, Phoenix, Drill, etc.

Despite having the same underlying storage mechanisms (SSTables, log-structured-merge-trees, etc), Cassandra does not expose SCAN and PUT operations. Instead it provides a very SQL-like interface, most similar to phoenix-on-hbase. There are however some significant limitations on the “where” clauses of queries:

  • the partition-value must always be explicitly provided (exact constant)
  • range-queries may be applied to the “clustering” part of the table key. When there are multiple “clustering” columns in the key, then exact constants should be provided for the part before the “range query” for efficiency.
  • all columns are untyped..

HBase exposes the concept of a “rowkey” directly to users. Cassandra instead uses the standard SQL conventions: a table has a “primary key”, consisting of a set of not-nullable columns in order. In a slight extension to SQL, the user must indicate which of the components of a “compound key” are to be used for partitioning the data. Queries against the table must then always provide exact values for these “columns” (except in the case of a full table scan).

Materialized Views

Bigtable-like databases only index on the “rowkey”. If data must be looked up by some other criteria, then a second table must be maintained, ie data “denormalized”.

Cassandra provides support for “materialized views”. A CQL (Cassandra SQL variant) statement defines how to derive the materialized view from a base table. Cassandra then creates a real table containing the results of applying this statement to each existing row in the base table. In addition, each time the base table is updated the derived table is updated appropriately. Updates are performed asynchronously, ie there can be a time-lag between updating of the base table and the corresponding change in the derived table. The derived table cannot be updated directly. There is of course a performance impact for this feature, as each update to the base table triggers a second update.

HBase does not itself provide the same functionality, but Phoenix-over-HBase provides something almost identical. HBase also supports generic “coprocessor hooks” which can be used to implement this with only a few lines of code.

DataStax Enterprise (DSE)

DSE is a commercial variant of Cassandra. One primary feature is integration of Solr search into Cassandra. select-statements can include solr search terms in the where-clause, and DataStax Enterprise will optimise the query appropriately.

DSE also provides better administration consoles for Cassandra.

References