Categories: Cloud, BigData, Programming
Introduction
Many databases which support SQL offer “analytic functions” (also known as “analytical functions”). I recently needed to perform some queries (on Google BigQuery) that regular SQL just couldn’t handle, so had to brush up on my knowledge in this area - and here are my (prettified) notes.
The Different Kinds of SQL Function
Normal Functions
Normal SQL functions apply independently to each row, without any inter-row interactions; an example is to_upper(fname)
whose output is never affected by values in other rows.
Aggregate Functions
Aggregate functions combine values from multiple rows and produce just one row of output. An example is:
select sum(x), max(x) from data
Aggregate functions are often used with groups. Grouping results in one output row per group:
select category, sum(x), max(x) from data group by category
The only “output columns” allowed in a grouped query are:
- columns used in the group-criteria
- the results of aggregate functions
When no group is specified for an aggregate function then the entire table is considered to be a single group and (as aggregates generate one output row per group) the output is a single row.
Analytic Functions
Analytic functions combine values from multiple rows (like aggregates), but produce an output for each row (like normal functions).
The set of rows over which values are combined is specified by a window frame clause. For each record, the window frame clause is applied to that record to determine the pane (set of records) over which the analytic function will be applied. The function is then applied, and the result output.
The window frame clause may also specify an order by clause (and usually does). This does not affect the order in which rows are returned from the select-statement, but instead affects which records are in each pane.
Analytic functions are often used with partitions. Partitioning effectively splits a table into a set of subtables. The pane computed for each record is clipped to the boundaries of the partition, ie a pane only includes records in the same partition. The process of partitioning is somewhat like the first pass of grouping, but without the following “roll-up” that generates just one output record per group.
When a query does not specify a window frame clause at all then the pane for each record is the entire partition. And when a partitioning clause is not specified, then one partition exists which contains the entire set of records selected by the where-clause. It is therefore valid to use an analytic function without an (explicit) window frame clause or partition - ie to write just “somefn(..) over ()
”.
The Over Clause
Each analytic function used in a select-statement must be followed by “over (..)
” to indicate that the analytic version of that function is wanted. This is necessary as there are some analytic functions with the same name as an aggregate function; without the over
the aggregate version will be used - and the SQL will be rejected if no group by
is present.
The form “over ()
” (with empty parentheses) can be used if the default value for the window clause is sufficient (see later).
While all aggregate functions in a statement share the same grouping condition, each analytic function in a statement has its own over-clause. As it is common for multiple analytic functions in the same statement to use the same window clause, BigQuery has a syntax to allow the window clause to be defined once within the SQL statement and referenced by name with “over somewindowname
” rather than “over (..)
”.
The row_number
Function
Probably the simplest of all analytic functions is row_number()
which simply generates increasing integer values for each row. When applied to a sorted set of records, this makes it easy to then select the first N records.
There is one slight problem: BigQuery does not allow analytic functions in where-clauses, nor does it allow referring to computed columns from where-clauses. This statement therefore does not work:
select id, row_number() over (order by score) as rn from data where rn < 3;
but this one does:
with
results as (select id, row_number() over (order by score) as rn from data)
select * from results where rn < 3;
There are a few details in the above query that are discussed later, but hopefully the purpose of row_number()
is reasonably clear.
Window Frames
A window frame clause is an expression which is applied to “the record currently being evaluated”, to determine which other records in the current partition should be included as input to the analytic function. The resulting set of records is called a pane.
Some examples of window-frame-clauses (from the BigQuery docs):
-
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
– (from start of partition to row being evaluated) -
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
– (all records in partition) ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
ROWS 10 PRECEDING -- equivalent to BETWEEN 10 PRECEDING AND CURRENT ROW
Any record offsets for the pane are clipped to the partition; for example when “ROWS 10 PRECEDING” is applied to the first record in a partition the pane includes just the current record, and when applied to the second record in a partition it includes just the first record and the current record.
When no window frame clause is specified:
- when there is no order-by clause then the pane for each record consists of the whole partition
- when there is an order-by clause then the pane for each record consists of all records up to the current row
It is quite common for a query using analytic functions to use the default window frame.
An example of the use of order-by is:
select id, sum(value) over () as totalSum, sum(value) over (order by id) as runningSum from data
in which totalSum is the same for each record, while runningSum is not.
The keyword RANGE can be used instead of ROWS, in which case the pane (set of records) is determined using the value of the order-by column +- the offset. Example:
select .. (partition by somecol order by myval RANGE between 2 preceding and 2 following) ...
For each record in the above query, the pane is the set of records where myval
is in range (curr-record.myval - 2, curr-record.myval + 2)
. Because the data is sorted by myval
, the set of records in the pane are all adjacent - but the number of records in that set may not always be the same.
Reminder: when no partition is explicitly defined, then the whole set of selected records is a single partition.
Partitions
Here is an example of applying the analytic function row_number()
to a partitioned dataset:
select row_number() over (partition by league order by score) as rn from data where ...
The partition-by clause actually takes a list of expressions; a column-name is a trivial expression. In BigQuery, the expression-types are not allowed to return a value of type struct, array or float.
The result of partitioning is a set of partitions (subtables) where evaluating the expression(s) on all records within a subtable returns the same result. In the simple case of a list-of-column-names, it means that all records in a partition have exactly the same value for the specified columns.
Within the subtables generated by the partioning, the data can optionally be sorted:
select fname, lname, someAnalyticFn(..) over (partition by to_upper(fname), to_upper(lname) order by fname asc, lname asc)
When no partitioning is specified (select .. over ()
) then the table is treated as one single partition. There can still be an order-by clause in the over (..)
without a partition-by criteria; this simply sorts the (single) partition.
Windows
The term window or window clause simply refers to a combination of partition-specification and window-frame-specification.
Some Common Analytic Functions
The most commonly-used analytic functions include:
-
row_number()
– returns an index of the row within the pane (often used with pane=partition) -
rank()
- similar torow_number
, except when two rows have the same value for the order-by criteria. In this case,row_number
arbitrarily assigns the row-number, whilerank
works like a sport competition : first, second, second, fourth. -
lead(expr)
- returns the value from some other record in the pane relative to the current row; useful for calculating diffs between adjacent records. -
first_value(expr)
- returns the value of expr when applied to the first record in the pane. This allows grabbing data from some other record and inserting it into the output for the current row - eg computing the diff between each record and the first record in the corresponding pane.
And examples of well-known functions that also work in analytic context (ie in a pane)
-
count(*)
- the number of records in the pane associated with the current record -
sum(expr)
- as for groups, but for each record the output sum is computed only over the set of rows in the pane associated with that record. -
min(expr)/max(expr)
- as for groups, but limited to the set of rows in the pane associated with the current record. -
avg(expr)
- as for groups, but limited to the set of rows in the pane associated with the current record
Note that some functions take zero parameters (they operate on “the whole record” rather than a specific column or derived value) - but the parentheses are still required.
Conclusion
Given the above info, this quote from the Google BigQuery docs may now be enlightening:
In databases, an analytic function is a function that computes aggregate values over a group of rows. Unlike aggregate functions, which return a single aggregate value for a group of rows, analytic functions return a single value for each row by computing the function over a group of input rows.
Analytic functions are a powerful mechanism for succinctly representing complex analytic operations, and they enable efficient evaluations that otherwise would involve expensive self-JOINs or computation outside the SQL query.
Appendix: Testing Queries with BigQuery
This section is not really about analytic functions; however it shows a useful technique for trying out some of the analytic functions described above.
To test any SQL statement, it is necessary to first insert some test data into a test table. This is somewhat more complicated with BigQuery than traditional relational DBs, as “insert” is a fairly heavy-weight command - and has a quota of 1000 operations per day.
The unnest
operator can instead be used to fake
a table for tests:
#standardSQL
with data as (select * from unnest(array<struct<id string, value1 int64, value2 int64>>[
("row0.f1", 12, 112),
("row1.f1", 13, 113),
("row2.f1", 14, 114)
]))
select * from data