This is a quick introduction to the Google BigQuery service, part of the Google Cloud Platform (GCP).
Here’s the 10-second summary:
- rather similar to Hive (from the Hadoop toolset)
- stores data in tables with strict schemas (similar to relational)
- can hold massive datasets
- very SQL-like query syntax
- highly parallel query execution
- stores its data as files
- uses column-oriented storage (efficient for queries where just some fields are referenced)
- no primary keys or foreign key constraints
- no indexes
- supports table partitioning - but only by date (either insert-date or based on column)
- supports table sharding - if partitioning by something other than date is desired (see “wildcard queries”)
- a time-to-live can be set on partitions to automatically delete old records
- column types very limited: boolean, int64, float64, string, dates (multiple types), bytearray, record, array. Note in particular: no fixed-precision numeric types.
- updates and deletes require “rewriting” of tables ie are expensive and slow
- batch-load-jobs are asynchronous but cheap (max 1000 per day per table)
- streaming inserts are moderately expensive
Use Cases for BigQuery
BigQuery can store massive amounts of data - and does so very cheaply. It is particularly cost-effective when storing data that is rarely queried; you pay only for file-storage (and not for a running database server, as is the case for other database types).
It is good at running queries which read a significant percentage of the records in a table; this processing is done in parallel using many VMs concurrently. This is the common access-pattern for OLAP (analysis) tasks, report-generation, etc.
It is good at appending new records to tables - and can do so far faster than traditional relational databases (adding a record to a table is just a single append to a file on disk).
It is very bad at selecting single records by key (or indexed fields) - it effectively does a table scan (with some optimisations), which is not only slow but also expensive. Databases such as Bigtable, Datastore or relational systems are a better fit for fetch-by-key.
BigQuery does support update and delete in its SQL language, but these operations effectively copy the table, replacing records (update) and dropping records (delete). When the table is partitioned, and the update or delete statement includes the partition-key in its where-clause then only those partitions are copied, but in general this is still a poorly-performing operation. Updates and deletes are therefore best for “batch corrections” of the stored data, not for regularly modifying single records.
- good for OLAP-style analysis tasks
- good as a data-source for machine-learning algorithms
- reasonable for archival data storage (when plain files in google-cloud-storage are not sufficient)
- very bad for OLTP-style workloads where individual (or small related sets of records) are read and written
Data Storage in BigQuery
Note: If you are familiar with the architecture of Apache Hive, then you’ve got a good start on understanding BigQuery. Think Hive with ORC-like storage and compulsory “managed tables”.
A GCP project can hold multiple BigQuery datasets, and each dataset holds zero or more tables. These tables are roughly like tables in a relational database; the differences are discussed later.
Each table is internally stored as a set of files in cloud storage - and cloud storage is well known to be capable of holding vast amounts of data. These files are not directly accessible; the only access to the data in them is via BigQuery commands. Nevertheless, keeping the cloud-storage-based nature of data-storage in mind is useful for understanding some of the behaviour of BigQuery.
Tables can be partitioned by date, in which case the underlying cloud-storage has a separate “folder” per day, holding the files in that date-partition. When a query specifies the desired partitions in its where-clause then only the files in the relevant folders need to be read, with obvious performance benefits. In addition, deleting all data in a partition is efficient (see time-to-live later). The date on which partitioning is done can be a column from the table schema, or can be the date at which the row was inserted.
Tables can also be “sharded”. In effect, a set of tables with a common name-prefix are created, and “wildcard queries” can be used to apply the same SQL command to all matching tables. As with partitioning-by-date, each “shard” stores data in its own cloud-storage folder making queries that only affect some shards more efficient, and allowing easy deletion of specific shards. Sharded tables do not need to have exactly the same schema, but widely differing schemas will make it hard to write SQL operations that can be applied over multiple shards.
Appending a batch of new records to a table (eg via a load job) simply creates a new file. A background thread may merge small files into larger ones for storage efficiency. When streaming records into BigQuery, the records are simply appended to a file which “rolls over” based on time or maximum size; this file is called the streaming buffer. BigQuery does validate that the new records comply with the table schema (which is a fairly easy and fast test). Because BigQuery does not support primary keys or foreign-key-constraints, there are no performance-sapping tests that need to be done to verify consistency on insert. Similarly, because BigQuery does not support indexes, there are no additional files to update and no associated disk seeks need to be done. Appending data to a BigQuery table is therefore very fast - or in other words, the rate at which data can be written to BigQuery is very high.
Records are stored within files in Capacitor format, which is column oriented (and similar to ORC or Parquet formats used by Hadoop and Spark). This format makes it very efficient to execute queries against the table which only test a small subset of fields in the where-clause or only fetch a small subset of fields in the select-clause. Storing nulls is also very space-efficient. In fact, data storage in such formats (Capacitor, ORC, Parquet) is far more space-efficient than with traditional relational DBs.
BigQuery vs Relational
Comparing BigQuery features to relational databases:
- tables have a fixed schema like relational
- no primary keys
- no indexes
- the set of datatypes is quite limited: string, bytes (bytearray), bool, int64, float64, and several date-types. No
- additional datatypes
arrayavailable, with SQL syntax extensions to access data in such fields
The lack of primary keys is the most striking difference. If duplicate records should not exist in the dataset, then the applications which write data are responsible for avoiding that. However BigQuery is best used as an append-only database, in which case the lack of keys is less relevant.
The “append-mostly” nature of BigQuery means that data models are typically slightly different. Rather than overwriting records, consider appending new “change event” records to the table. The database thus becomes a historical record of events that have occurred, rather than a snapshot of the current state. Where such a “current state snapshot” is needed, it is worth considering computing this from the historical information, and then either storing that as a new table in BigQuery, or writing that state to a mutable database such as BigTable or a SQL database. This is sometimes called event sourcing; see also the lambda architecture and kappa architecture.
The lack of indexes means that queries are based on table scans (with some optimisations to skip over irrelevant blocks of data). This is actually quite efficient for the kinds of workloads that BigQuery is intended for - running queries that compute sum/min/max and other mathematical properties over large datasets, or joins of large sets of records. In short, table-scans are not bad for OLAP-style workloads. As noted earlier, the lack of indexes means that inserts into BigQuery are very fast. The use of table-scans as a base principal also makes it easier to parallelise operations - eg run a query on hundreds of nodes in parallel, each dealing with a subset of the table.
GCP has three different levels of geographic distribution:
- zone - a specific datacenter, eg “europe-west1-b”
- region - a set of datacenters, eg “europe-west1”
- multi-region (aka location) - a set of regions, eg “EU”
The BigQuery service is available only at the multi-region level. Each BigQuery dataset has an associated multi-region which controls where the data in the associated tables are stored. A BigQuery query is also executed in a multi-region; normally it is not necessary to specify this as the location will be determined from the location of the queried datasets.
It is not possible to specify that data is stored in a specific region, nor that BigQuery execution occurs in a specific region.
Open question: is a query that joins tables across multi-regions allowed? probably not..
The easiest way to get data into a BigQuery table is to write it to a file in Google cloud storage, and then to schedule a BigQuery load job. This will be executed as soon as there are some idle BigQuery worker nodes - which of course depends on how busy the Google datacenters are. In practice, a load-job starts executing within a minute or so. And load-jobs are free, regardless of how much data is loaded! Of course, you pay for storage once the data is in BigQuery, and you pay to query it.
There is a quota of 1000 load jobs per day per table, and 50,000 per project - quite generous.
There are quite strict constraints on the format of the data to be loaded. Formats CSV, JSON, and AVRO are supported.
- the file may have a header
- the field-separator is configurable, but must be a single character
- date and timestamp columns must have exactly the expected format
- the file must contain one JSON object per line (no array-start or array-end chars are expected or allowed)
- date and timestamp columns must have exactly the expected format
If the data to be loaded does not comply with the above formats, then either the data must be modified before load, or a dataflow program should be written to read lines from the file and write to BigQuery. A dataflow application can still submit the data as a load job.
Load jobs are atomic; either all of the data is loaded into the target table, or none of it.
As alternative to load jobs, data can be written to BigQuery in “real time” (streaming). As noted earlier, BigQuery can save data at a high rate. Data streamed into a table is appended to a set of temporary files called the streaming buffer; the records are later moved to their “proper” location.
Unlike load jobs, streaming writes are charged per megabyte - and the costs are significant.
When querying a date-partitioned table into which data is being streamed, data in the streaming buffers is only included in the query for dates 30 days in the past or 5 days in the future!
When executing a “batch” load job, the loading of data is atomic - either all data is inserted (and exactly once), or the job fails and no data is inserted. When inserting data via streaming, things are not so clear. Inserts are fairly reliable, but when there are network problems between client app and BigQuery, or server problems within the BigQuery system then the client (using BigQuery client libs) may auto-retry; because there are no primary keys in BigQuery this can cause the same records to be inserted twice. The client application sending records to BigQuery can reduce the probability of this by tagging each record with a unique “streaming id”; this value is saved into the streaming buffers within BigQuery but later discarded when the data is moved from the temporary streaming buffers to the real table partitions. BigQuery silently drops records if a record already exists within the streaming buffer with the same id; this means that retries which cause the same records to be inserted multiple times within a short time period do not cause duplicates - where the “short period” is however frequently BigQuery moves data out of the streaming buffers.
If the records being inserted have a “natural primary key” then the following query from the BigQuery documentation can remove duplicates - by copying the table and excluding the undesired records. This can only be done when streaming is not currently inserting into the table, and is inefficient - ie should be done infrequently or only when needed. Of course you need to replace
ID_COLUMN with the “logical key” of your table (criteria for which no duplicates should exist).
#standardSQL SELECT * EXCEPT(row_number) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ID_COLUMN) row_number FROM `TABLE_NAME`) WHERE row_number = 1
Accessing External Data Sources
BigQuery has some basic (primitive) support for querying data that has not been loaded into a BigQuery-managed table. However its support is not as extensive as Hive, where external data is a first-class citizen; the number of queries of external data per day is limited by a quota, and performance is poorer.
External data can be read from:
- cloud storage
- google drive spreadsheets
When reading from cloud storage, the requirements for the source file are the same as for a load-job: CSV, JSON or AVRO, with the constraints documented in the section above on load jobs, eg timestamps must be in exactly the right format. The data is not actually loaded into BigQuery, but each query counts towards the quota for load-jobs. Such a solution is therefore only appropriate for data which is accessed rarely, or just once before being transformed and stored natively in BigQuery.
One interesting additional file-format supported is a Google Datastore backup file. This is the only Bigquery/Datastore integration available.
The cloud-storage bucket in which external data is stored must be of type multi-regional, with a multi-region that matches the region in which the query is being executed; see the dropdown named “processing location” in the “show options” pane of the query window in the BigQuery web UI.
Querying Bigtable from BigQuery seems to require a rather odd syntax (I haven’t tried it myself).
One of the primary use-cases for external data sources is to read them once, and insert the data into a “proper” BigQuery table, ie to use the external data definition as part of an ETL process.
There is no way to query external relational databases from BigQuery.
Sharding and Partitioning
As noted earlier, tables can be partitioned by day - either based on a column in the schema, or on the date that a record was inserted.
Queries which use the partition-date in the where-clause are then far more efficient. Partitions can also be efficiently deleted.
BigQuery also offers a basic kind of partitioning based on non-date columns; this is better named “sharding”. In effect, each such shard is its own table, but the tables have a common name-prefix. A “template table” can be defined, allowing new “shards” to be automatically created when data is inserted.
Partition-based Time To Live
When a table is partitioned by date, then the table can have an associated “partition lifetime” setting. When a partition’s associated date is older than the lifetime then the entire partition is deleted automatically. For some kinds of datasets this can be very convenient.
Note that there is a limit on the number of partitions in a table; this effectively limits a date-partitioned table to around 7 years of data.
Standard and Legacy SQL
The original query language for BigQuery was a significantly modified variant of SQL; this is called “legacy SQL”. Later, BigQuery was updated to support nearly standards-compliant SQL (“standard SQL”).
Unfortunately, in the BigQuery web UI the “legacy SQL” syntax is still the default - meaning queries in the new (standard) syntax are rejected by default. The expected syntax can be changed by opening the “options” pane in the query screen of the UI, and unticking “legacy SQL”, but it soon gets annoying doing this repeatedly. It is easier to simply add
#standardSQL in the query-editor pane before the query itself. Example:
#standardSQL select ....
BigQuery pricing is simple; only three factors are relevant:
- the amount of data stored in BigQuery tables
- the amount of data streamed into BigQuery tables
- the amount of bytes read/written by executed queries
- the amount of inter-region network traffic
Data in BigQuery tables is actually stored as (hidden) files in Cloud Storage, and this storage is charged at exactly the same price as regular Cloud Storage usage. In addition, the file-format used by BigQuery is extremely compact/compressed. Data storage is therefore very good value.
Writing data into BigQuery using background “load jobs” is free (limited to 1000 jobs per table per day). Streaming (real-time) writes are charged per gigabyte and can be significant.
Each query is charged according to the number of bytes that the query needed to read from disk. Pricing is reasonable - see the documentation.
When queries are issued from within the same GCP region, then network traffic is free. For queries issued from outside the same GCP region, network traffic is charged at the usual rates.