Storage Space Efficiency in Avro and HBase

Categories: BigData

HBase with Long Column Names

There are many sources which state that HBase rows “should have short column names”, eg from the HBase documentation or an IBM article. The reason is obvious when you look at the way that HBase stores data internally.

But exactly how bad is the impact of long column names, particularly when the data is compressed with the Snappy algorithm? Dennis B. and I investigated..

This article was inspired in part by an interesting article by Baranowski who compared speed and disk usage for various storage formats including HBase, Avro, Parquet, and Kudu. However he did not measure the effect of short vs long column names in HBase.

Measuring Space Usage

Simple Python programs were written to write data to HBase and to AVRO, with various settings. Links to the source-code are provided below.

For each test, a schema was created with varying number of columns. Each column was given a random alphanumeric name, where the length of the name varied by test. A large number of rows were then written, with the value in each column being a string of length 10 chars that was chosen to be not very compressible.

The use of random strings for column names was to ensure that snappy did not heavily compress the names. Deliberately difficult-to-compress strings were used for column values to ensure the column storage part did not unrealistically outweigh the actual data storage.

The rowkey for each row is a string “row-{N}”.

HBase Storage Format

An HBase “table” consists of a set of HFile-format files.

Each hfile is just a sequential list of (rowid, column-name, version, value) entries, one for each “cell” (ie for each column of each row). Metadata, bloom filters, min/max column values, etc. are also present in the hfiles but the space required should be statistically not significant.

This format means that the row-id and column-name are repeated once for each column value, which is not the case with row-oriented databases.

This repetition is what makes the “schemaless” feature of HBase possible; a new column can be added at any time. However this also potentially leads to a less space-efficient encoding on disk - exactly the issue this article sets out to measure.

Storing Data with Avro

As a comparison to HBase, data was stored in AVRO binary format, with and without compression. An AVRO container file in binary format includes a copy of the schema once, then each row is stored in a fairly efficient manner. For reasonably large numbers of rows, the size of the schema should not be significant.

Note that colnum and rownum are not padded, as lots of zeroes would presumably make snappy look even better than it is..

A test writing 100k rows with 100 cols and 20 chars per value produced the following results. This test is comparable to the HBase results for 100 cols and various column-name-lengths (colnames in the AVRO test had 100 chars, but that is not relevant for AVRO as the schema is only written once).

  • without compression: 206Mb
  • with snappy compression: 59Mb

A test with 1000 columns produced the following results:

  • without compression: 2.2Gb (ie approx 10x 100-column requirements, as would be expected)
  • with snappy compression: 687Mb

Just as a comparison, the theoretical optimal storage space without compression should be just the space needed for the column data. This was calculated simply as nrows * ncols * (sizeof(colvalue) + 2), on the assumption that value-strings are stored uncompressed with a 2-byte “field length”, and that the space to store the schema is not significant. The colvalue isn’t always of a fixed size, as it is of form “colnum.rownum.abc.rownum.colnum” where column is between 1 and 4 chars, and rownum between 1 and 6 chars. However on average, assume value is around 20 chars.

The optimal storage for 100k rows times 100 columns times (approx) 22 bytes per column value is 209Mb. Obviously, the “without compression” results are almost identical, ie AVRO does store data extremely efficiently.

Just for sanity, I confirmed that setting the column-name to only 10 chars makes almost no difference (just the schema at the front is larger, not the per-row values). With 100 columns of 1000 chars, also nearly identical.

Interestingly, although the value-strings were chosen as “reasonably hard to compress”, snappy did a good job, reducing the file to just 29% of its original uncompressed size.

Storing in HBase

Here are the results of populating HBase tables with various numbers of columns having various column-name lengths:

  • 10 columns with 10-char names
    • uncompressed size: 60M
    • snappy-compressed size: 12M
  • 10 columns with 100-char names
    • uncompressed size: 146M (2.4 times space for 10x10)
    • snappy-compressed size: 16.9M
  • 100 columns with 10-char names
    • uncompressed size: 615M
    • snappy-compressed size: 125M
  • 100 columns with 100-char names
    • uncompressed size: 1.4GB (2.2 times space for 100x10 for some reason, not 10x)
    • snappy-compressed size: 368M
  • 1000 columns with 10-char names
    • uncompressed size: 6.2G
    • snappy-compressed size: 2.1G
  • 1000 columns with 100-char names
    • uncompressed size: 14.6G (no surprise, 10 times more than 100x100)
    • snappy-compressed size: 10.3G (surprisingly little savings compared to uncompressed)

Interestingly:

  • moving from 100x10 to 100x100 only slightly more than doubled the hbase storage required (615Mb vs 1400Mb)
  • moving from 100x10+snappy to 100x100+snappy approximately triples the storage required.
  • moving from non-snappy to snappy reduces size to 20-30% of the original.
  • moving from HBase to Avro reduces size to around 30% of original based on 100x10-char colnames (3x better), 15% (7x better) based on 100x100-char names, and even better for more/longer names.
  • moving from HBase+snappy to avro+snappy reduces to around 50% of the size for 100x10-char colnames (2x better), 16% (6x) based on 100x100-char names

Data sizes for the resulting tables were measured with a simple “hdfs dfs -du -h ..” on the base path for the table within HDFS. Disk-space was measured before and after ‘disable {table}’ to ensure all data was migrated from WALs to table-files but that made little difference.

Conclusions

The conclusions for compression are:

  • For storing large amounts of data in Avro, definitely turn on Snappy
  • For large amounts of data in HBase which is seldom read, definitely turn on snappy
  • For large amounts of data in HBase which is regularly read, measure - you probably want to turn on snappy

The conclusion for Avro vs HBase is:

  • AVRO storage is much more efficient than HBase (3 to 10 times better, depending on column name lengths)

The conclusion for column-name lengths in HBase is:

  • Short column-names are clearly more efficient than long names, but not quite as bad as would be expected.

The impact of using long column names in HBase is not quite as serious as logic would imply. Logically, it would be expected that moving from having each cell hold (10-char-rowid, 10-char-name, 8-byte version, 20-char-value) to (10-char-rowid, 100-char-name, 8-byte version, 20-char-value) would increase storage by 48/138 = 3.5 times. Instead, we see only a 2.2x increase when snappy is not in use (1.4G/615M) and a 3x increase (368M/122M) when snappy is used. We are not completely clear why the impact is slightly less than expected. Results of snappy compression were also inconsistent between the 100-column and 1000-column tests (3.8x compression vs 1.4x compression).

Nevertheless, for large datasets, short column names (and family names) are indeed recommended - doubling or tripling data storage costs just for more descriptive column names is usually not worth it. The extra storage space will also increase IO times during query execution.

The actual ratio will depend upon the data in each column; when columns hold large data values (eg large binary arrays or long strings) then the cost of longer column names is relatively less. When columns hold numeric or boolean values, the cost of longer column names will be relatively larger.

And if you can store in Avro rather than HBase, that seems well worthwhile (3-10 times less space, depending on number and size of columns). And with Avro you can use whatever column names you desire, as the names are saved only once per file!

In a large cluster, having to buy only 1/3 as many physical disks (and corresponding servers to host them) is a truly significant saving.

Source

The source can be found here.