This article looks briefly at the Google Cloud Platform component Cloud Dataprep, which is useful for cleaning and structuring data being imported into Google BigQuery. This is part of a series of articles on the Google Cloud Platform (GCP); start here for an overview.
Disclaimer: unlike most of the GCP technologies I have written about recently, I haven’t used actually Dataprep - just reviewed the documentation to see whether it could be useful as part of an architecture I am currently designing (the answer is probably no).
What Dataprep Does
Dataprep (or “cloud dataprep” to give its full name) assumes you have data in one or more sources (files in a supported format, BigQuery tables, etc). It then helps with:
- determining what the schema should look like for the output;
- extracting the relevant bits of data from each input record (and discarding the irrelevant bits);
- typecasting fields into suitable types (ie the types in the schema);
- clipping fields to suitable ranges of values;
- discarding weird values, or replacing them with default values;
- joining data from one table or file against the “matching” records from another source;
- and other typical “data preparation” tasks.
The above steps are applied to just a sample of the input data, for performance reasons. The steps taken to prepare the data are then used to generate a GCP DataFlow Application which can be applied in a scalable way to the full dataset (this is supposed to be big data after all!). This dataflow takes the inputs, transforms it to solve the above issues, and outputs the data (typically to BigQuery tables, or to files).
The process is interactive and graphical; the Dataprep user needs good business domain knowledge and reasonable data-science or data-modelling knowledge, but little knowledge of programming or the GCP platform.
Data Preparation Requirements
Often the input data has things like:
- nulls where you really want some non-null default value
- an integer column that is usually in some range (x,y) but where a few records have screwy values like -1 or 9999. Here, you might want to clip values to (x,y) - or just drop records with bad values.
- a column which is mostly integers, but with the occasional non-integer value.
- nested data (eg JSON with objects within objects) when you want a “flatter” relational-like view.
- data in multiple tables, when you want one joined table as output
And often you don’t know much about your data, for example:
- how many distinct values does a column have?
- what is the (x,y) range in which 99% of the values for a column falls?
- which columns are actually irrelevant, and can be dropped?
- which columns are foreign keys referencing records from a different input file
Dataprep Functionality in Detail
Dataprep was not actually created by Google; it is an application from Trifecta which has been integrated transparently into the GCP environment. It has been generally available in beta state since March 2017 (note that GCP services often stay in Beta for very long periods of time). The user-interface through which the user interacts with it is actually at https://clouddataprep.com/, not at a Google domain.
Dataprep provides a graphical drag-and-drop web interface for building dataflows which read from various sources and write to BigQuery or cloud-storage.
I won’t attempt to review the interface here, or the feature set which is well documented; this article is just intended to indicate which projects might find it useful.
It is a fully hosted service, ie SaaS. There is no need to “create an instance” or create a set of VMs to run the product on; the user is simply charged when executing the transformations defined in the user interface (see pricing later).
Designing the necessary transformations (ie Dataflow) is done using an “example-based” approach, which can be easier to deal with than “logic based”. After each transform, you see a sample of output that would produce.
Dataprep helps with joining records from different sources, eg suggesting columns from multiple tables that appear to have shared values, and thus potentially can be joined on. Theoretically a user can deal with related data without knowing the details of SQL joins.
And the resulting “recipe” for transforming input into output is “human readable” - english descriptions rather than code. This recipe can then be mapped to a Google Dataflow application and executed efficiently against very large input datasets.
Supported input formats include:
- BigQuery tables
- Files in Google Cloud Storage, in the following formats:
- CSV with single-char separators, no embedded linefeeds in columns
- linefeed-separated JSON objects
- Excel spreadsheets
Dataprep initially treats input fields as a string; Dataprep then uses heuristics to suggest suitable data-types and transformations for fields.
The supported output formats, ie destinations that transformed input data can be written to, are pretty much the same as the input formats.
The most common combination is for the input-format to be files (in CSV or JSON format) and the output to be BigQuery tables. Further transformations, analysis and reporting can then be applied to the data in BigQuery.
Using the interactive web interface to inspect input data and define transformations to apply when importing the data into BigQuery is free. The generated “recipies” can be saved for later use.
When the “recipe” generated via the interface is actually applied to a full input file (rather than just the sampled subset that the UI uses) then (to quote the official docs) this is
billed as a multiple of the execution cost (Cloud Dataprep Units) of the Cloud Dataflow job that performs the data
This multiplier is currently just 1.16, making Dataprep a very reasonable solution for one-off imports.
Data preparation can also be done in Google Data Studio.
The generated dataflow gathers data statistics as it runs, and reports them back to Dataprep; the “job results” page in dataprep shows this info as nice graphs/histograms.
Summary and Opinion
As a programmer, I am often less than impressed by graphical tools that promise to achieve tasks “without programming”. There are several issues with such tools:
- the user does not need to learn general-purpose programming - but does need to learn this specific tool;
- the results usually cannot be version-controlled, reviewed, or “rolled out” to QA and production environments;
- automated testing is often not possible;
- error-handling and dealing with unusual cases is often ignored;
- and various other issues.
In the case of Dataprep, some of the steps demonstrated in the video to extract sub-fields of text strings felt ugly and unstable - a properly thought-through solution using standard code would be more stable. How versioning and reviewing of the solution, and migrating the solution between environments, is to be done is also not clear from the documentation.
However these concerns might not be critical for some use-cases; it depends upon the task. In particular, one-off imports may well benefit from the Dataprep approach. In other cases, it might be helpful for a domain expert to define an initial import, and for this to then be reimplemented as code for production use. That would also reduce concerns about the costs of using Dataprep-generated dataflow applications long-term.
The way that Dataprep works on a sample of data, and provides very quick feedback on transformations of that sample, does look very helpful. The pricing also seems very reasonable.
I will definitely discuss use of Dataprep with my DataScience colleagues next time a complex data import needs to be implemented - whether one-off or a regular production task.