The Snowflake Data Warehouse

Categories: BigData

Overview

I recently saw some references to the Snowflake Data Warehouse. Snowflake’s detailed documentation is excellent, but I could not find a decent overview, ie what use-cases it is best suited for. A few hours of research were needed to discover the details; this article is the result.

Snowflake is a database which is available only “in the cloud” as a hosted service (“SaaS”). It currently is available only on the AWS cloud, but is not tightly coupled to Amazon and may well be available on other clouds (eg Azure, GCP) in the near future.

It is intended for “data warehousing” tasks, and is similar to GCP BigQuery or Hadoop Hive. It stores large amounts of tabular data efficiently, and runs analytics-type queries efficiently over large datasets.

It is not an “OLAP Engine”, ie “high level” data warehouse tool, and does not directly offer support for star schemas, cubes, etc. If you are interested in this kind of functionality, see Apache Kylin (though that requires Hive, and does not work with Snowflake).

It is also not an OLTP database, ie should not be used for regular updates and lookups of records by exact key.

I am not personally involved with the Snowflake product in any way, and am not currently a user of it - all the following information is just from basic research.

Most Significant Features

The most important attributes of Snowflake Data Warehouse at a glance:

  • Not a “full OLAP data warehouse solution”, just a database (similar to BigQuery/Hive)
  • Not open-source; cloud-only SaaS (pay for CPU usage by the hour and storage by data volume)
  • Cheap data storage (all data persisted in AWS S3, with heavy data cacheing on worker nodes for performance)
  • Nice control over performance (select size of worker VM pools)
  • Worker VMs can be destroyed when database not active (because data is stored in S3)
  • Cheap “clone” and “snapshot” operations (due to immutable underlying files in hive/hbase style)
  • Excellent support for “document-type data” (hierarchical/nested)
    • Data can be imported from JSON/AVRO/XML and similar structures into Snowflake without flattening
    • Also has interesting approach to handling types within such data (see later)
  • Near-standard SQL support - but with extensions for handling document-like data
  • All data saved in its own column-oriented storage format (similar to Parquet/ORC)
  • No primary key constraints or foreign key constraints (like BigQuery/Hive)
  • Inefficient single-record lookups, updates and deletes (similar to BigQuery/Hive; Snowflake is for OLAP not OLTP)
  • Reasonably low-latency queries (like BigQuery, HiveLLAP or Impala)

Internally, its architecture is somewhat like a cross between Hive, Hive-LLAP, and HBase. There is an excellent paper describing the implementation in reasonable detail.

Comparison with BigQuery

Snowflake addresses similar usecases to GCP BigQuery - storing large to extremely-large datasets (1TB - 1Petabyte) and doing statistics/reporting/machine-learning over large subsets of the dataset.

Costs approximately the same as GCP BigQuery for data storage and query execution. Of course exact pricing depends on usage patterns, but pricing is certainly not an order of magnitude different.

Performance is also roughly the same - though of course both have many “tweakable” parameters that trade off performance against price, so it is hard to make an exact comparison.

Both support “columns” holding array-typed or nested-record-typed values. However BigQuery always requires a strict schema for data, while Snowflake allows importing of data without a schema. Snowflakes’s ability to import and query data without explicit schemas is very interesting and the most innovative part of Snowflake.

Snowflake’s snapshot and clone operations are also interesting, and do not appear to have any equivalent in BigQuery. BigQuery does support “point in time snapshots” for data up to 7 days old via “table decorators” but that feature appears to only be available via the “legacy SQL” syntax - and is not nearly as flexible as Snowflake.

BigQuery has special support for “streaming writes”, ie for small data items to be frequently appended to an existing table. It is not currently clear how Snowflake handles such data.

BigQuery can read external files as well as data imported into its own tables; this is particularly useful for joining large datasets already in BigQuery with small datasets held externally (in GCP cloud storage, or on Google Drive). It appears that Snowflake only queries data that has already been imported into it (written in its own data format to S3).

BigQuery supports “explicit time-based partitioning” of tables, which offers two benefits:

  • Queries that specify the partitioning-timestamp column in the where-clause are more efficient (whole partitions can be skipped)
  • Data can be efficiently discarded when older than a threshold

Snowflake also partitions data to make querying efficient. However it does this automatically, without explicit declaration from the schema author. It is not clear how effective this can be - I would expect that partitioning is most effective when exactly one column is chosen, and query authors are aware that the column is a partitioning criteria. It certainly does not allow data to be efficiently discarded as BigQuery does.

BigQuery only charges for use when queries are actually being executed; there is no need to “stop a database” at any time. Snowflake does require some additional administration to minimise costs. However for high-priority jobs, it would seem that Snowflake “virtual warehouses” offer an easier way to guarantee priority.

Comparison to Hive

Snowflake addresses similar use-cases to Hive.

Due to the fact that data is persisted into effectively unlimited AWS S3 storage, Snowflake will scale far better than Hive. The ability to allocate “worker VM pools” of various sizes for query execution, and then to destroy those pools when they are not needed, together with the cheap cost of S3 storage, is likely to make Snowflake more cost-effective than Hive.

Once a Snowflake worker pool (“virtual warehouse”) is running, latency for queries is low (like HiveLLAP). Startup/warmup of such pools may take some time, ie the first few queries may be slow.

Snowflake is only available on AWS, ie is not an option for data that must be retained on-premise.

Hive supports very flexible partitioning of datasets, which must be explicitly defined on table creation.

  • Queries that specify the partitioning-timestamp column in the where-clause are more efficient (whole partitions can be skipped)
  • Specific partitions can be efficiently discarded.

Snowflake also partitions data to make querying efficient. However it does this automatically, without explicit declaration from the schema author. It is not clear how effective this can be - I would expect that partitioning is most effective when exactly one column is chosen, and query authors are aware that the column is a partitioning criteria. It certainly does not allow data to be efficiently discarded.

Snowflake Documentation

There is excellent user-level information available. The only thing really lacking is an overview - which this article hopefully provides.

There is an excellent academic paper on the Snowflake design. I wish all software producers would do something like this! Note that this paper is a little old now; some of the problems they point out with GCP BigQuery no longer exist (eg BigQuery now supports almost-standard SQL).

Standard SQL Syntax Support

The Snowflake team have gone to considerable effort to support ANSI-standard DDL (defining tables etc) and DML (queries, updates, etc). Scripts designed for other databases should generally work unchanged on Snowflake. However that doesn’t mean that such scripts have the same effect. As an example, tables can be defined with primary and foreign keys, and indexes. However such definitions are not enforced - a nicer way of saying they are ignored.

Similarly, selects specifying a full primary key are efficient in relational databases - but not so with Snowflake, as it is an OLAP-style engine and not an OLTP-style engine. Like BigQuery/Hive/etc, a “select” effectively triggers a full-table-scan (but with optimisations; see file format below).

Snowflake adds non-standard extensions to SQL in order to support querying document-like (ie non-tabular, or nested) datastructures.

File Format

Columnar storage similar to ORC/Parquet. Hive, BigQuery and similar tools do not keep “indices” of records. Snowflake also does not keep indices (although it does track some metadata due to its “micropartitioning” approach.

When indices are not availale, any select must be a “full table scan” - but various optimisations can still be applied:

  • Skipping of irrelevant partitions
  • Per-block metadata containing min/max values for columns
  • Per-block bloom filters (probably)

Datafiles are immutable:

  • Each modification produces a newer version of each affected file
  • Immutable files supports efficient snapshots (rewind) and clones

Datafiles are stored in “object storage” aka “blob storage”:

  • Currently only AWS S3 supported, but others may come later
  • Storage is thus cheap, cheap, cheap and very reliable.
  • Worker nodes can be shut down without losing data
  • File format is proprietary to snowflake, but similar to ORC/Parquet

For document-typed data, there is some cool auto-type-detect functionality. For each block of data which is imported without a schema (eg from JSON or XML), Snowflake scans all values of each column within the block to deduce the datatype (eg integer, timestamp or string). For each column, the data is then stored in its raw format (string) and its deduced type (eg compact integers). Queries such as sums over integers can then be effectively applied to such data efficiently (no repeated parsing). When a large document is imported (producing multiple blocks of storage) then the type can possibly be different for the same column in different blocks (though that is unlikely). See the Snowflake paper for details.

Virtual Warehouses (VWs) aka “worker node pools”

A Snowflake user (or admin) can start a “virtual warehouse” (pool of worker nodes) at any time. Queries (or updates) can then be executed on that “virtual warehouse”; worker nodes can access existing files within longterm S3 storage. Multiple virtual warehouses can access the same tables concurrently (via lock management system). Virtual warehouses are often started for specific tasks, and terminated when that task is complete.

As queries are executed, data is cached on locally-attached storage on worker nodes for efficient queries later. The “query optimiser” for each VW is aware of which nodes have which files from S3 cached, and direct queries that may affect that file’s contents to the worker node on which that data is already cached. Unbalanced cacheing of data is dealt with by “work stealing”: during a query, when one node is idle then it “takes over” cached files from busy nodes.

While a Snowflake Virtual Warehouse is based on a pool of VMs, this is mostly hidden from users; a user just “creates a VW” of one of the available sizes (small/medium/large); no low-level administration is required.

A VW runs only in one AWS availability zone, ie needs admin intervention in case of full zone outage.

While file cacheing works around the relative slowness of data retrieval from S3, it does mean that a VW has a “warm up period” during which queries execute at reduced speed.

To manage costs, VWs should be shut down when not needed, or at least reduced in size. This is somewhat more complex than BigQuery, where charges are only incurred when queries are actually executed.

Document Database Support

Snowflake supports schemaless (nonstructured) data.

Fields can be of type VARIANT for “schemaless” storage - VARIANT is a union of all other types.

When loading data without schema, for each column, each “block” of data is scanned to auto-determine the type (eg in a json array of objects, they are divided into blocks of 10K rows. Then for each unique fieldname, the type is determined. When a column with a specific name (eg “count”) frequently exists and always has type INT within that block, then column-oriented storage is used to store “count” separately from other fields, and with integer type.

The use of column-based storage gives the usual column-oriented benefits for queries. The use of a concrete type allows min/max and bloom-filters to be computed for later query pruning. Using a type also gives much more compressed storage (though snowflake also keeps the original string-format for safety).

Performance tests show “schemaless” data can be queried very nearly as fast as data with explicit schema.

Data Updates

Snowflake is “designed for analytic loads”, meaning data is most efficiently added via:

  • bulk or incremental appends
  • bulk updates

Snowflake does support ACID transactions, but this does not mean it is anything like a relational database (Oracle, MySQL, etc). Transactions are implemented via “snapshot isolation” on top of MVCC (no table locking). This generally means that when any two transactions touch the same data, then the first one to commit succeeds and the other fails on commit (optimistic locking).

The statement “touch the same data” might possibly mean “change different records in the same file” - need to test exactly how well Snowflake supports concurrent updates. This is not a common pattern for data warehouses, so it may well be that (despite the “acid transaction support”) concurrent updates from multiple applications should be avoided.

Additional minor stuff

Snowflake:

  • performs transparent online upgrades (without downtime)
  • supports custom UDFs in javascript
  • has its own internal “metadata store” similar to HCatalog
  • uses a key-value store for things like transaction logs.

SQL compilation/optimisation is always done server-side (hiveserver2-style, cassandra-style) rather than client-side (simple hive). In other words, Snowflake runs like a traditional “database server” rather than an “embedded sql compiler library”.