Categories: BigData
Introduction
The terms “column store database”, “column oriented database” or “column oriented storage” are sometimes encountered when reading about “big data” databases - ie those intended for large volumes of data. This article is intended to give a brief description of what they mean (or can mean, depending on context).
This information was originally part of the Big Data Storage article on this site. As it is apparently something many people are unclear about, I’ve extracted it here to a separate page in order to make it easier to find. See the original article for more background about alternate types of data storage technologies.
Definitions
What does it mean for a database to be a “column store” or be “column oriented”? Unfortunately, these expressions (and variants thereof) are used by various sites/documents/books to mean four quite different things:
- A traditional relational database, where tables have columns;
- A database where 1:N relations are modelled by dynamically adding columns to a row;
- A disk format in which each column is stored in its own file; or
- A disk format in which columns are grouped together in blocks within a file.
The first one (tables have columns) is hopefully obvious; we are all familiar with the relational model.
The other meanings are described below. By the way, I’ve defined my own names for these (the section titles), just to distinguish them for the purpose of this article.
Wide Row Schemas
There are a number of NoSQL databases which encourage developers to dynamically add columns to rows in a table to store associated data. In particular, they encourage modelling 1:N relations as N additional columns on a row/record.
The concept of storing data in “expandable” rows like this is sometimes called “column oriented”, regardless of the actual implementation.
File Per Column
When traditional relational databases write their data to disk, they generally write the content of all columns for the same row into a single block of data within a file. In other words, to read back the entire content of a row is a single read operation.
A pure “column store” database instead has a directory per table, and within that directory there is a file for each column (at least conceptually). To append a new record, a single value would be appended to each file. To read the entire record back, a read-operation is needed for each file.
This “column store” disk format has obvious disadvantages when inserting or reading individual records. When writing or reading large numbers of records at a time, the disadvantage is much smaller.
Where the format really has benefits is when a query scans many rows, but only looks at a few columns - eg “select min(age) from customers
” in which every record in the table is checked, but only one column is needed. In this case, only one of the files needs to be read, saving large amounts of disk IO. Operations like “select sum(quantity) from orders where product_id = 123
” would also be a good candidate - the row might have 50 columns, but the query only needs to access two of them.
This approach also has benefits for data compression; compression of values saves a little disk-space but more importantly it saves IO. In modern computers, it is often faster to read compressed data into memory then decompress it than it is to read the uncompressed form - CPUs and RAM are far, far faster than disks. The best compression algorithm to use often depends upon the data type; a sequence of integers, or dates, or strings can be best compressed with a type-specific algorithm. Grouping by column allows far better compression than the row-store format in which columns of different types are mixed together.
Grouped Columns
It is possible to get many of the benefits of “column per file” layout while still having a single file per table, by grouping records together in “blocks” within the file.
Given a sequence of 5000 records, each with 4 columns named C1..C4, these can be grouped into blocks of 1000 and stored in the file with layout:
- value of C1 for the first 1000 records
- value of C2 for the first 1000 records
- value of C3 for the first 1000 records
- value of C4 for the first 1000 records
- value of C1 for the next 1000 records
- value of C2 for the next 1000 records
- … etc
When reading data, this has similar performance properties to the “one file per column approach” described above. A query that needs only the values of C1 can do a single seek, then a large read to get 1000 values from that column at a time, without reading any unwanted data. Unlike the pure approach, a seek is then needed to “jump over” the unwanted columns to the next block of C1 values.
The data compression advantages of grouping values of the same type are also available to this format.
Obviously, appending a new row is not particularly easy in this format. It is therefore best used by systems which hold records in memory until a threshold is reached (eg 100k records), and then dump them out as a new file in this format. Many big-data batch-processing systems work in this way.
An example of this is the ORCFile (Optimised Row Columnar) format commonly used with Apache Hive.
Implementations of this kind of layout can perform additional tricks to optimise disk space usage and performance. In ORCFile format, a bitmap is present at the start of each column section to indicate which entries are actually nulls; this reduces the diskspace for storing a null “cell” from whatever its native size is (eg 64 bits for a long integer) to 1 bit. Each block of N columns also includes the min and max values for all values in the block, so that selects like “where C1 > 100” can skip over entire blocks of values. Bloom-filters can also be used to optimise the question “is value X somewhere in this block of data?”.
Bigtable-like Datastores
Some very clever software developers at Google came up with a radically new architecture which they called BigTable, which allowed Google to build its indexes of the internet. Google never released the implementation of their “bigtable” datastore, but did publish an academic paper. Several open-source datastores were implemented using the ideas described, and thus these datastores have similar features. They are often referred to as “bigtable datastores” or “bigtable-like databases”. The best known ones are Apache HBase and Apache Cassandra.
Bigtable databases fall into several categories: they are a hybrid of row-store and column-store database, and support dynamic columns (“wide schemas”).
Like many (but not all) NoSQL databases, tables in HBase or Cassandra don’t have strict schemas, and different rows can have different numbers of “columns”. A single row may have many million “columns”, allowing quite large 1:N relations to be modelled in this way. They are thus clearly “wide” with regard to schema.
HBase and early versions of Cassandra expose the “dynamic columns” approach explicitly. In later versions of Cassandra (3.x), its high-level query language has instead switched to representing collections of values as lists/sets rather than “dynamic columns”. Under the hood the implementation is still the same; N extra individual columns or a single column with N values are logically equivalent - the difference is simply how it is presented to the user (programmer or query writer).
Bigtable datastores support the concept of “column families” which are groups of columns. Each column-family is stored in its own file, but columns within the same family are stored together. Therefore, a table which has a single “column family” holding all columns is effectively “row-store”, while a table in which each column is in a separate column-family is in “column store” format. The software developer therefore can analyse the access patterns (which columns are accessed together) to optimise the storage layout. This approach does lose the ability to do type-specific compression - but BigTable databases typically don’t enforce type-constraints on columns (ie a column can contain any type) and so don’t have the necessary information to do type-specific compression anyway.
As a side-note, Bigtable and early versions of Cassandra referred to tables as “column superfamilies”, ie a “table” is a set of column-families, each of which has a set of columns. This is true, but was unnecessarily confusing when the term “table” was already in wide use and the term “superfamilies” is now only found in older documentation.
OLTP vs OLAP
Applications used for day-to-day business (eg the customer-support apps in a call-center, or ecommerce sites) typically want to read (or insert) whole records at a time. They also read only a few records, but want responses within milliseconds (or at least seconds). These are called OLTP (online transaction processing) applications, and their access patterns generally work best with data held in row-store formats.
Applications used for business analysis and reporting often want to summarize a few columns of data across many records, often running for hours. Here the overall throughput is important, but latency for small reads is less so. These are called OLAP (online analysis processing) applications, and their access patterns generally work best with data held in column-store format.
The BigTable-like datastores, being a hybrid of row-store and column-store can potentially work well for both use-cases. Bigtable-like databases do, however, have a range of other unusual features which might make them a very good choice, or a very bad choice, depending upon the use-case.