Categories: Programming, BigData
Overview
This article gives an overview of the core concepts of the Data Vault method for building a data warehouse.
This is an extension to my earlier article on data warehousing. You might also find my article on dimensional modelling helpful.
This article assumes you (the reader) are familiar with relational databases, entity-relationship modelling, sql, reporting tools, computer networks, and client/server architectures.
Information Sources
This article is primarily a summary of the content of the official book on the Data Vault method:
Building a Scalable Data Warehouse with Data Vault 2.0, Linstedt, Olschimke, publisher: Morgan Kaufmann 2015 (649 pages)
Not only is that book hundreds of pages long, but the bibliography also contains hundreds of references to further sources - reading all these references would be a year’s work. This article is just one (long) page; it should therefore be clear that a lot of detail has been left out. This article is intended to give only an overview, to let you decide whether learning more about Data Vault is relevant for you. It is also my personal notes on the subject, in case I need to refer to them later…
In short: data warehousing is a complex business; the following info just sets the scene..
The Data Vault book is not as well written as Kimball’s book on data warehousing (though that is a high goal). In particular, the first “overview” chapter is a bit waffly. The DV book also has poor formatting (appears to be a common problem with books from Morgan Kaufmann publishing house), and includes many examples (and screenshots) based on Microsoft products from the year 2014 - now rather dated, and a little annoying if you are not intending to use Microsoft products. In particular, chapter 11 and 12 are half full of MS product screenshots. Yecch. However the content is worth battling through the presentation issues for.
I would recommend reading both the Kimball and Data Vault books - the Kimball book for its excellent presentation of the issues of data warehousing, and of dimensional modelling, and the Data Vault book for its very interesting and potentially superior approach.
Why Data Vault?
Data warehousing has a long history; companies have been interested in centralizing and reporting over “all their data” for a long time. Two different design philosophies have been dominant for a long time: the Inmon approach (aka CIF or DW2.0) and the Kimball approach. Data Vault is a relatively recent alternative, but has become quite popular.
The Data Vault approach is defined in a set of books by its inventor, Dan Linstedt. The Data Vault books describe “the data vault warehousing approach” which consists of:
- Data Vault Methodology
- Data Vault Modeling
- Data Vault Architecture
- Data Vault Implementation
The Kimball DWH methodology also addresses all of these areas - ie neither methodology is limited to just a data model (representing entities and relationships).
The name “Data Vault” is trademarked, and may be used only with permission. Anything that calls itself “Data Vault” is therefore an “official” product - unlike “dimensional modelling” for example, which is just a descriptive term. However Data Vault is a design pattern for data warehouses; it isn’t necessary to buy any specific products or tools to use the concepts.
Like Kimball and Inmon, Data Vault recommends that end users be provided with “data marts”, each of which contains a subset of the available data, selected for their specific interests and represented using dimensional modelling principles to make it easier to write reports. Unlike Kimball (and like Inmon), Data Vault recommends an extra data tier.
In a Kimball DWH, data comes from the upstream source into an ETL processing phase (which Kimball calls the “back room”) and is then written into a data mart in dimensional form. There may be storage within the ETL/backroom area, but this is only temporary “working” storage; long-term data persistence is done in the data marts. The overall DWH is the union of all data marts.
The Data Vault approach differs from the Kimball approach by adding an extra layer:
- upstream data is processed via ETL logic which might involve temporary tables (as with Kimball)
- data is then written into a master database in “data vault format” called the Raw Vault
- and then data is exported from the Raw Vault into data marts
In a Kimball DWH, the data marts are the master storage, and losing one is a serious problem. In Data Vault (and the Inmon approaches), the data marts are not the master data storage, and can in fact be discarded and regenerated if necessary.
Note that in both cases, users see just the data marts, and these are based on dimensional modelling principles to make life easy for users who write reports against the available data.
Having a central DB of course increases the amount of storage needed - data is stored both in the central store and one (or more) data marts (but see later for info on “virtual data marts”). It also means that a database technology is needed which scales to “all the data” - unless the central store is sharded across multiple DBs which makes things somewhat more complex.
However having a “master store” which does not need to be in “easily queryable form” provides a number of advantages. The exact form chosen by Data Vault (hubs, satellites and links) solves a lot of common DWH problems related to:
- handling invalid data in the input
- handling multiple upstream sources of data that are related to “the same entity”
- handling schema changes in upstream sources
Inmon appears to recommend a similar “mid-layer” called the Operational Data Store, holding data in 3rd normal relational form. However Inmon’s books and articles are so vague it is not entirely clear how this central data store works. The Data Vault books are very clear and comprehensive on this topic.
The remainder of this article looks at what this intermediate form (hubs/satellites/links) is, what advantages it brings, and what complexity it adds.
Data Vault 2.0
The original Data Vault approach (documented in articles, books, and taught in courses) has been updated - the latest recommendations are defined as Data Vault 2.0. The primary change is a switch to using “hash keys” instead of traditional surrogate keys; this allows more parallelisation in the ETL phase - which is in turn important for very large datasets. I am not sure what other changes exist between DV1 and DV2. This article discusses DV2.0 only.
Data Vault Architecture
The overall architecture of a Data Vault-based data warehouse consists of:
- a staging area which does not store historical info and does not transform the input data except basic type-mappings
- a warehouse layer in data vault model format
- one or more data marts
- an optional operational vault - a vault-format database within the DWH to which external processes are granted read/write access
- a “managed self-service BI with ability to write to the data warehouse” - not clear to me what this means
The warehouse layer consists of:
- a raw vault – the core data store
- a metrics vault – information about DWH-internal processes such as ETL
- a business vault – cache of post-processed raw data “after business rules have been applied”
- a meta mart – metadata about the DWH
- an error mart – holds information about ETL processing errors
where things called “vault” are in data-vault (hub/satellite/link) format.
The Data Vault architecture works fine with NoSQL systems as inputs. The book claims that NoSQL can also be used for the vaults in the core DWH - but the book does not provide any details (“out of scope”) and assumes a traditional relational DB.
Splitting the transform from source to mart into two phases (ie introducing the raw vault in the middle) allows data transforms to be split into two categories:
- normalisation applied during initial import into raw vault (“hard business rules”)
- normalisation applied during export to data mart (“soft business rules”)
This means less data needs to be rejected in the first phase. Data that cannot be imported at all is tricky to deal with. It is also possible to change “soft rules” later and re-export data; changes to hard rules can be more difficult to deal with.
The staging area is generally expected to be a relational DB whose tables mirror the schemas of the sources of incoming data, but with all columns nullable and without foreign keys/indexes to enable loading of malformed data. The tables are expected to be cleared after each successful ETL run. Alternatively, data can be imported directly from files into the raw vault without being loaded into any intermediate database. The Data Vault book describes the staging area in more detail.
The Raw Data Vault stores data in three kinds of structures: hubs, links, and satellites. These can be relational tables, but other options are possible. In general, for each “data type” represented in the incoming data (eg customer) there is:
- a hub table for that data-type containing one significant column: the “business key”. Each instance of that data-type has one row in the corresponding hub table
- link tables which join hubs together, ie represent relations between entities. A link refers to two or more hubs.
- one or more satellite tables; each table provides a set of attributes associated with a hub entity or a link entity.
This representation is covered in more detail below.
The raw data vault contents (like all “vaults”) is not expected to be accessed by users.
A Data Vault DWH also contains an Error Mart and Meta Mart. Errors that occur during data processing are written into the Error Mart, and end users are permitted to query the Error Mart to do analysis on data quality and other issues. The Meta Mart contains “data catalog” type information about the contents of the raw data vault, information about ETL processes, backup logs, parameters for “business rules” applied to data, etc.
The DWH optionally includes a Metrics Vault which ETL processes write statistics to. If it exists, there may also be a corresponding Metrics Mart which end users are permitted to query for information on ETL performance.
The DWH optionally includes a Business Vault which ETL populates with data that results from applying “business rules” to the contents of the Raw Data Vault. This is then used as an aid when exporting data to a Data Mart - ie the Business Vault cleanly separates derived data from raw data. A Business Vault can be recomputed from the Raw Data Vault when needed. In general, transformations are applied as part of the export to a data mart; however when a transformation is complex and is needed for more than one data mart it should instead be applied earlier with the results stored (cached) in the Business Vault.
Datastores named “Vault” are in “data vault model” form (hub/satellite/link) and are not intended for use by business users. Datastores named “Mart” use dimensional-modelling, multidimensional modelling (OLAP cube), 3rd normal form modelling, or some other form which is suitable for business users. The various “Vaults” are usually within the same physical database. Each “Mart” may be in the same physical database or be in a physically separate system.
While business users never access “Vault” datastores, it can sometimes be useful to allow external applications to access them. Such applications must be able to handle data in data-vault-model layout (hub/satellite/link). In particular, an external Master Data Management system may be set up to replicate data directly into the data vault, bypassing the staging area (as long as that system is data-vault-model-compliant). User-facing systems that display historical data might also be implemented via direct access. The fact that incompatible changes to a schema for a data-type are handled by creating a new satellite rather than modifying an existing one makes this more stable. Real-time ETL (rather than batch-mode) may also write directly into the data vault model, bypassing staging. The subset of the raw data vault that is externally accessible is referred to as the Operational Vault.
Ideally, users draw data from a Data Mart. However it takes time to set up a data mart and export the appropriate data into it; it is therefore considered valid to allow users to query the Raw Data Vault directly. Such users will need to be able to cope with the Data Vault model, and handle the fact that business rules have not been applied to the data (though data in the Business Vault may help) - or wait for a more user-friendly Data Mart customised for their needs to be built. This approach should be considered more as allowing users to do their own ETL from the vault into their own data-mart than as running analysis on the vault contents - the vault strucuture is not suitable for analysis.
The methodology section of the Data Vault book (relating to management of the DWH development process) is a bit thin, vague, and has no original insights. However that may be partly because I come from a software development background where many of the principles of agile development are long considered standard practice - it might be different for teams used to developing BI solutions.
The Data Vault book assumes the use of Microsoft database products. It also assumes that the Raw Data Vault fits onto a single SqlServer instance (with multiple disks). There are assertions in the book that “this design can scale”, but no details. It also assumes that traditional database backup strategies can be used. IMO, these assumptions are not valid for modern businesses - data no longer fits on a single server in most cases. Note that it does assume “data marts” (aka information marts) are physically separate.
Data Vault Modelling
Overview
This section looks specifically at how the raw vault (or any other vault) actually stores the imported data.
A Data Vault stores data in three kinds of structures: hubs, links, and satellites. These can be relational tables, but other options are possible. In general, for each “data type” represented in the incoming data (eg customer) there is:
- a hub table for that data-type containing one significant column: the “business key”. Each instance of that data-type has one row in the corresponding hub table.
- link tables joining the hub to other hubs, ie representing relations between entities. A link refers to two or more hubs.
- one or more satellite tables; each table provides a set of attributes associated with a hub entity or a link entity.
There is a separate satellite table for each data-source that provides information about instances of that hub or link. If a source changes its data-format in a non-backwards-compatible way, then that also results in a new satellite table. Each satellite table has appropriate columns to hold data coming from its upstream source - plus a foreign key to the corresponding hub or link entry.
This data model makes it very easy to import data about the same entity from multiple upstream sources - there are no “column conflicts” because each source has a table customised to hold whatever data that source provides. The hub entry ties together data from different sources related to the same instance of a particular type (stored in satellites). Because the data has very few transformations applied to it (just char-set conversions, date conversions, etc) there is no danger of data-loss due to incorrect (or changing) transformation rules. It is also a representation suitable for auditing.
Interestingly, this data model looks something like a graph database, with nodes (hubs), edges (links) and attributes (satellites).
Data vault contents is not expected to be accessed by users. In special cases, external applications may have rights to read (or even write) directly into tables associated with a Vault - eg the Master Data Management system might be configured to replicate its data directly into the DWH.
Hubs
Each hub has exactly the following columns (no more and no less):
- a “hash key” (rather than a surrogate key)
- load date (date this entity was first seen)
- record source (system this entity was first seen in)
- a single “business key” (a single-component or compound key used by source systems to identify this entity)
- an optional “last seen” timestamp
The hash-key is a fixed-length string, generated via an algorithm such as MD5 or SHA-1. It is a hash of the “business key”.
Sometimes the same entity has different “business keys” in different source systems. In this case, a hub entry is made for each different business key, then one of the entries is selected as the “master” and a “Same As” link entry is created to map each “secondary hub” to the “master” one. If the business keys “overlap” (eg are sequential integers) then the two hub entries need to be in different hub tables, with the link pointing between the tables. When not, it is simpler to put both hub entries in the same hub table - with a different value for “record source”. Of course the best solution of all is to update the source systems to use consistent keys for the same entities.
The “last seen” field can be used to create deltas from full dumps of an upstream system; when only one upstream system provides that data entity, then an update can “touch” the last_seen
date of each record as it is read (even when identical). After the load, any record with an old last_seen
value has been “deleted”.
Because a hub has no “business attributes” at all, except for the business key, it does not need versioning, and is almost immune to schema changes in upstream systems. This stability is really useful - a hub record acts as a point around which changing data associated with a specific entity is organised without itself ever being updated. The only time a hub entry is deleted is when that entity is completely purged from the data warehouse. All attributes for a hub are held by associated satellite records - which can have any schema necessary to represent upstream data.
Links
Links typically model:
- transactions (events)
- associations
- hierarchies
- redefinitions
Links do not have time-stamps - other than a “load date” for auditing purposes. They represent a relation that was true at some time - regardless of whether that relation is still valid. Satellites attached to the link may include a start-date indicating when that link was active (with the start-date of other satellite entities providing an implicit end-date).
There are two kinds of links:
- Historized aka normal aka versioned
- Nonhistorized aka transactional (I think “immutable” would be a better name)
Historicized links represent things whose attributes might vary over time; this doesn’t mean that a link record is ever updated but instead that a new record might be inserted into the database which represents “more recent data”, making the existing record “of historical value only”. Associations and hierarchies are usually of this type.
A historized (aka normal or versioned) link has exactly the following columns (no more and no less):
- a “hash key”
- load-date
- record source
- hash-keys for two or more hub entries
- zero or more “dependent child keys” aka “degenerate fields” (special-case optimisation)
The hash-key is a hash of the business keys of all hub entities that the link references, plus all dependent child keys. A normal (mutable) link then has one or more satellites attached to it, showing the attributes valid from a specific timestamp. When these attributes change, another satellite is added with a new timestamp. Note that the original attributes are not modified; the link is “logically mutable over time” but the older data is never lost.
A nonhistoricized (transactional aka immutable) link has attributes that never change over time. Events (eg financial transactions) are often of this type - an event binds various entities and cannot “unhappen”. The recommended way to model this is a (link, satellite) pair similar to the above, but where the link also has a “transaction id” field that uniquely identifies the actual event/transaction (and is part of the hash-key). Alternatively, the satellite attributes can be added directly to the link table - but this makes such tables different in structure from other links, complicating ETL somewhat.
A link table is capable of modelling many-to-many relations.
The 3rd-normal-form approach of representing a many-to-one relation is to embed the foreign key of the “one” as a column on the “many”. However if the relation becomes many-to-many in the future, this is a significant change to the data model. Links represent any relation (regardless of multiplicity) with an approach similar to the one used in 3rd-normal-form for representing many-to-many relations (external table for relations) and thus avoids the need for such changes. In addition, one-to-one relations in normal form require the reference at one end, ie is implicitly a directional relation. The external link table does not make it necessary to choose which “end” of the relation is more important. Of course, the normal-form approach ensures that what is intended to be a 1:1 or 1:N relation is not incorrectly transformed into an inappropriate N:N - but such concerns are not important in a data warehouse where data is added in a very controlled way.
Like hubs, links have no “business attributes” at all, except for the business key. They do not need versioning, and are almost immune to schema changes in upstream systems. The only time a link entry is deleted is when a referenced entity is completely purged from the data warehouse. All attributes for a link are held by associated satellite records - which can have any schema necessary to represent upstream data.
The “dependent child keys” is a special case due to the fact that links often represent “events” (similar to fact tables in dimensional modelling). Because events (facts) can be extremely frequent, and are naturally immutable, it makes sense to push the event properties into the link record itself rather than store them in a separate satellite - though (unlike with satellites) this is problematic if the schema of those events change.
Satellites
A Satellite provides attributes for a hub or link. A satellite may have time-stamps defining the period over which those attributes apply.
The primary key of a satellite is a compound value:
- the hash-key of the entity it references (only ever one)
- and a “load-version”
The “load-version” is usually the timestamp at which the ETL run started; as no single ETL run should create multiple satellite records of the same type for the same entity, the (key, date) pair should be unique. However when a single ETL batch contains multiple “versions” of the same record (eg as happens when loading historical data) then an appropriate date should be chosen. And when the satellite is being populated “in real time”, eg via a change-data-capture feed, then the “load-version” should be (or include) an incrementing sequence number.
A satellite entry only references one hub or link, and no other table ever references a satellite. Satellite tables are never hierarchical, and never shared between entities.
A satellite record has as many additional attributes as needed to hold properties provided by the source. Optionally, an additional “hash difference” field may be stored which is a hash of all other fields in the satellite, in order to help with “delta detection” for data during ETL.
When an update for a satellite record is received from upstream (ie new attributes for an entity), then a new satellite record is inserted - existing records are never deleted or updated. The new record contains a complete copy of all the relevant attributes, ie makes the old record “obsolete”. When a satellite associated with some source system has many fields, and some of those fields change often, then it may be appropriate to define multiple satellite tables to reduce churn in the non-changing properties.
Unlike dimensional modelling, it is acceptable to have nulls (missing data) or other “invalid values” in satellite columns; in fact the data should be “as raw as possible”. Nulls and invalid values are acceptable as DataVault is an intermediate format, not one intended for actual reporting. Data quality control/validation is primarily done when exporting to data marts rather than importing into the raw vault.
If an upstream system alters its data format such that it no longer fits into the existing satellite table(s), then new satellite tables should be defined to hold the new data. The older tables should be left as they are. The ETL export to data marts will need to be adapted to deal with the change in format.
Every hub and every link entity is expected to have at least one satellite entity referencing it.
Hash Keys
In Data Vault 2.0, each hub, link, and satellite record has a primary key which is created via hash(business-key)
. This is quite different from the usual approach of using an incrementing integer as a surrogate key.
The use of hash-keys instead of surrogate keys means that federated systems are easier to build. Two disconnected databases which have the same business keys will derive the same hash-key independently.
This use of hashes is somewhat similar to how the Git distributed version control system works. SHA-1 produces a 20-byte (160-bit) output; due to the “birthday paradox” the chances that two different business keys hash to the same value is somewhat more than 1 in 2^160
but still a large number - somewhere around 10^24
records would be needed in the database to get a 50% chance of collision. Assuming 1 million (10^7
) records per second, this 50% threshold is reached after one thousand million (10^10
) years. In short, collision is not a problem - it is more likely that the data-center gets hit by a meteorite. There are a few deliberate attacks that someone could use to deliberately find two different business keys that hash (using SHA-1) to the same value - but those still require extreme amounts of processing power, lots of time, and the ability to then somehow inject those specific keys into the databases being loaded. Other hashes such as SHA-256 could be used if you are really worried about that.
Having primary keys which are fixed-length fields (SHA-1 generates 20 bytes) mean that DBs can easily embed the data inline in a record. Variable-length fields are trickier to deal with. And when the business-key is a compound of multiple fields, both time and space can be saved. The even distribution of outputs also means that hash-based partitioning of data results in a nice even spread of data across partitions. When concatenating fields before hashing, always use a field-separator.
Handling Changing Data and Schemas
Data Vault represents changing data significantly differently than the “SCD” approaches for dimensional modelling. Hubs, links and satellites never change - instead a change is represented by a new satellite entry. Satellite entries have a start-date only; an “end date” is implicitly implied by the start-date of another satellite entity linked to the same hub/link with a newer start-date.
Hub, Link and Satellite tables are never updated, except:
- when setting “last seen” on a hub or link entity to help with detecting records deleted upstream
- when setting an “end date” on a satellite entity to help with performance of export to data marts
A link referencing two hubs represents data “per x and y”. Sometimes upstream systems increase the grain of measurements, providing data “per x, y and z”. This is dealt with in Data Vault by defining a new link table with references to the new set of hubs. The old link table, with its associated data, can now be considered as a “rollup”. When exporting to a data mart, some tricky decisions need to be made about how to deal with this but at least at the raw vault level, the “real truth” of the situation is clearly preserved.
Enhancing a link in this way is equivalent to adding a new dimension to a fact table in dimensional modelling - it increases the granularity of the data. However dealing with the situation when using a dimensional model as the “master data store” is more complex.
Hierarchical relations between entities of the same type can be represented simply as links with “parent” and “child” references. This is simpler and more natural than dimensional modelling’s “bridge tables”. Relations modelled in the DataVault way cannot be easily queried from BI tools but that is not the role of Data Vault storage.
Naming Conventions
Often “hub” tables are named with one of the following forms:
H_{Name}
HUB_{Name}
Hub{Name}
{Name}Hub
Usually, link and satellite tables follow similar conventions.
Data Vault Compared to Dimensional
Although Data Vault is not a dimensional model, I think it may still be useful to think about which tables have “bounded size” (dimensions) vs “unbounded size” (facts).
Hubs always represent “dimensions”. Links also represent “dimensions”, except for “non-historicized links” which are roughly equivalent to fact-tables.
Exporting to Data Marts
End users are not expected to access vault-format stores directly; instead after data has been imported into the raw vault, it is then exported into data marts (possibly with the help of the Business Vault).
Implementing this export is non-trivial, but well documented in the Data Vault book. It may take the form of code to be executed, or simply be a series of (moderately complex) SQL statements performing “insert into .. select ...
” to copy data from the raw vault and business vault into a target data mart.
There are a couple of techniques recommended by the Data Vault book to make this process easier and more efficient.
Creating a “type 1” dimension table (one without history) in a data-mart is reasonably easy:
- truncate existing table
- use something like “
insert into .. select .. from SomeHub h left join SomeSatellite s on h.hashkey = s.hashkey where s.ValidTo is null;
”
The select chooses just the most recent satellite for each hub. In practice, data might be drawn from multiple hubs. The joins are only on the hash-key, and filtering only on the valid-to, so this could also be feasible in a NoSQL database that has indexing only on the primary key.
A “MERGE” SQL statement might be an alternative when the DB supports it.
Creating a “type 2” dimension table (one with history) is somewhat more complex. The book example uses two subselects, a union, 3 joins and a where-not-exists. The critical fields again appear to be the hash-keys and the load-dates, so possibly still feasible without indexes other than the primary key.
To optimise export to data marts, a PIT (Point In Time) table can be generated for any link or hub table where multiple satellites exist. For each hub/link entity (ie hash key), and for each “date of interest”, a single row in the PIT exists which contains the key of all satellite records “active at that date”. The key of a satellite record is a compound of (parent-hash-key, load-version) - ie the table has 2 columns for each satellite table. The “hash key” is of course the same for each record (they all reference the same parent) - except when there is “no such record”, in which case the (hash, date) pair should point to a “ghost record” (dummy record) with default values. Each satellite table should have a single “ghost record”; the hash is typically set to all-zeros and the load-version set to the lowest possible value. Such tables are part of the “business vault” - and are derived from the real data (ie can be regenerated at any time).
PIT tables are also useful in the ETL phase.
Similarly to PIT tables, a “bridge table” can be generated to summarize links. These are often implemented as “materialized views” of complex SQL statements.
Virtual Data Marts
Sometimes it is possible to provide a specific data mart (subset of the data vault for a specific set of users or use-cases) simply via database views (a “virtual data mart”). Of course this assumes that all the data needed in a specific data mart is within a single database in raw vault form; the data mart is then presented within that same database instance as a separate schema containing views rather than concrete tables.
A hybrid of traditional and virtual data mart is also possible; PIT and bridge tables can be materialized (ie are physically stored rather than being views executed on-demand) while all other tables are views defined in terms of the PIT and bridge tables. This provides performance that sits half-way between:
- completely materialized data marts, ie where all data has been copied (disk-space-intensive, dev-intensive, but fast to query)
- completely virtual (view-based) data marts (cpu-intensive, slow to query but fast and flexible to develop and without disk storage costs)
Materialized PIT/bridge tables are considered to be part of the Business Vault.
Metadata
The DataVault book recommends keeping metadata on data and processes within the data warehouse environment. Metadata includes:
- data structure and location
- data meaning
- data owner (data steward)
- data source system (location, system version)
- upload schedules
- exception handling rules
- business rules/parameters for data validation
- data volumes (for capacity planning)
- data quality
- process execution control and logs
Normally, there is no MetaVault (hubs and satellites) but instead a “mart” with such data. It might be a relational DB in relational normal form, or in dimensional form, or might be a “logical” source of data managed by the ETL tool rather than a raw DB.
Often, ETL code is written to be almost completely parameter-driven, where the parameters are in the Metadata store.
Staging
Staging is the component where data is accepted from upstream systems, and fed into the ETL processes that load the raw vault.
Sometimes data is imported from upstream systems directly into a relational database within the staging area. In this approach, the target database schema usually contains no foreign-key constraints between tables, so that tables can be loaded from upstream systems in any order. (Note: using hashes for references when pushing data into the raw vault also helps with out-of-order loading, as synchronized sequence number allocation between referrer and referee is not needed).
Processing of data (upload to raw vault) is usually done as something equivalent to “for each record in the primary table joined against other tables ...
” - which means that indexes on tables are usually not helpful, and only slow down the initial insert into staging.
Hash keys and hash-diffs (hashes over all attributes) can be computed early.
Loading the Raw Data Vault from Staging
The Data Vault book describes two basic approaches to moving data from a relational DB in the staging area into the raw vault:
- using a sequence of SQL statements
- using code (possibly generated via a dedicated ETL tool)
The logic in either case is roughly the same.
Step 1: retrieve all business keys from the source data (staging) and ensure that the corresponding hub entities exist (create if not).
Step 2: create links similarly to hubs
Step 3: create satellite records
- try to omit obvious duplicate records in the staging area, for performance
- retrieve the latest satellite for the same referenced entity (hub or link)
- do a join from the staging data against the satellite table where load-end-date is null (ie is “current”)
- drop record if current is identical to staging (possibly use hash-diff for this)
- otherwise insert new satellite
Step 4: deactivate obsoleted satellites
- set load-end-date on satellites where new versions were inserted, in order to mark them as “not active”
- an UPDATE statement with nested select can do this, as can windowing functions
Note that computation of hash-keys has been done during loading of data into staging.
Note also that in step 3, the “drop obvious duplicates in staging” and “drop record if current” are required only if the data in staging is a “complete dump” of the upstream database. If the upstream system instead delivers a reliable “delta” only (ie only records which have changed) then these steps are not necessary.
In general, hubs and links are not deleted. When an entity (hub entry) can become “no longer active” then a satellite table should be used to indicate whether the associated hub is active or not. The same approach is used for links. The satellite usually contains start and end dates (note that as usual, satellite entries are not modified after being created). Such a satellite is called an “effectivity satellite”.
Sometimes records do need to be deleted for privacy reasons. That is a special case, and is done in the obvious way.
Detection of deletes needs to be done with care: when the upstream system is just delivering “dumps” (snapshots) of the complete state, then there are various reasons why a particular entity might not be included in a dump other than it having been deleted upstream. These include a dump being only partial due to the upstream system running out of disk space, or a bug in the ETL process.
Deleted records can be detected obviously when a “change data capture” stream is provided. When a snapshot is provided, then either:
- find the difference between the set of keys in staging and the set of keys in the raw data vault, or
- have a “last seen” date or etl-version on each hub and link, and update this during ETL then find the keys that have not been updated. Normally, hubs and links are not updated in a data-vault, but this special timestamp is just an “ETL helper” field and not actually part of the user data. If such update is a problem, a separate table could be used to collect the updated keys.
- combine the above: anything with a recent “last-seen” date can be skipped; compare the remaining set of records with the set of keys in the incoming data.
Important: as much data as possible should be loaded into the raw data vault - data quality tests should be applied after loading, not before. The raw DV should hold “all the facts, good bad and ugly”. Of course business keys need to be populated. In addition, satellite tables have properly-typed columns eg date-typed instead of string-typed (using char-type for everything would mean extremely poor performance and complicated selects). That means that some type-conversions will not be possible and such records must be written to a “rejected/error” destination (an “error mart”) rather than being loaded. However such destinations are hard to deal with - import data into the raw data vault wherever possible. This means most/all columns should be nullable or have a special representation for incoming null values, etc., and data quality is applied at a later phase.
A note regarding processing of “dumps” vs “deltas” as inputs. Obviously it is preferable from a performance point of view when the upstream systems provide only records which have changed within a table (a “delta”), rather than a complete copy of a table. However it is important to remember the difference between dimensions and facts in dimensional modelling - ie bounded vs unbounded datasets. Transactional events (facts) almost always have an associated timestamp (the point in time at which the event occurred), and thus upstream systems can very easily provide a “delta” (all records since the last ETL run). Reference data (dimensions) do not naturally have an associated timestamp (only some systems bother to keep a last_modified
timestamp on such data) and therefore are often provided as full-dumps. However dimensions are bounded and thus small; inefficient processing is not so important here. Fact tables (representing events) may be very large - but will almost always be deltas.
Data Vault Primary Strengths
Now that the principles of Data Vault have been covered, it is useful to look at a couple of things it does very well in comparison to other approaches.
Handling Invalid Data
One central requirement for any DWH is that data held for reporting purposes needs to be clean. Reports run against tables that have weird values, or “non-normalised” values, will behave badly and the result is unhappy users. However data does often arrive with weird or non-normalised values - so what can be done?
In a system where the “master storage” is also the base for reporting (Kimball-style), the data must be cleaned or discarded before landing in long-term storage - and thus any original not-acceptable values are permanently lost. Of course data warehouses are just a cache for data from upstream “operational” systems. However often these upstream systems keep only a minimum of historical data, for cost and performance reasons. It may therefore not be possible to go back to them and “reimport” old data when needed.
In a system where there is an extra tier of storage, data can be stored as it was received (even if invalid), and cleaned when being exported to data marts.
The difference is particularly important when rules related to how data should be “cleaned” change over time. With the data mart as master storage, the original representation is gone and retrospective changes to data-cleanup are not possible. Auditing of the data-cleaning process is also not possible. With the data vault approach, incoming data is kept “as original as possible”; it can then be re-exported to data marts using different cleanup rules when necessary - and there is better traceability from the data-mart “clean” versions back to the original data.
A related problem is that “clean” data may mean different things to different users. Having “raw data” in the Raw Vault allows different transformation rules to be applied as the same data is exported to different data-marts for different use-cases.
Business users often state that a data warehouse should be a “single version of truth”. Data marts can provide this. However such data has been curated/cleaned/modified/selected. Data Vault’s internal storage (raw vault) instead provides a “single version of facts” - ie a wide range of corporate data in one place, raw and (relatively) uninterpreted.
Handing Multiple Sources for the same Entity
When two upstream systems describe the same entity (at a business level), then a “hub table” is defined for that entity type (one row for each entity). The hub table has little more than a key. For each upstream system, a “satellite table” is then created which holds the properties of that entity as provided by that upstream system, plus a foreign key reference to a record in the corresponding hub table. When N upstream systems refer to the same entity, the data-vault will hold 1 “hub” entity, and 1 entry in N different satellite tables which all reference the common hub entity to tie them together.
This makes the first step in ETL relatively easy - there is no need to figure out how to map different interpretations of specific attributes to a shared form. It also provides excellent traceability as the data in the data-vault is very close to the upstream form. Some normalisation is usually applied, eg converting characters to a standard character-set, representing dates and numbers in the same form, etc. However no mappings should be done which will “fail”; this minimises the chance that data simply cannot be imported, ie an import process must discard a record or reject a whole import batch.
Data Quality
Ideally, upstream sources would provide perfect data. Next best option is for the data warehouse to detect when upstream sources are delivering bad data, and work with the business to get the problem fixed upstream (the consumers of the poor-quality data in the warehouse are the best people to estimate the costs of this problem and put pressure on upstream to fix it; the IT department can detect problems but neither estimate the financial impact nor apply pressure at business level).
And the final option is to patch over the problem during data-warehouse ETL processes.
Sometimes the correct action is to omit a bad record; in other cases to set an “invalid” column, and in others to replace invalid values with valid ones.
The nice thing about Data Vault is that just about any data can be imported into the raw vault; cleaning up data is done during the export to a data mart instead. Where such cleanup is time-consuming, or needed by multiple data-marts (or both) the results of the cleanup can be written to the Business Vault. A satellite table containing raw records with data quality errors can be “mirrored” by a different satellite table in the Business Vault with cleaned-up data. This derived satellite table can in some cases be implemented as a database view of the original table (a “virtual satellite”).
A “data quality tool” can be used to define the rules that valid records must comply with, and what action to take when the rule is broken. The tool may then generate necessary code to derive a “cleaned” table from the original.
Master Data Management (MDM)
In order to get consistent reporting across multiple departments within a business, it is necessary to standardize the data. Costs and profits cannot be compared when gathered in different time-periods, using different tax rules, and using different product codes. The usual solution to this is to take each subset of reference data that should be standardized (whether time-periods, tax codes, or product lists), assign a single group to be responsible for that reference data, and have their “master” version of the reference data be replicated to each department rather than having the departments maintain their own versions of this reference data. This process is called Master Data Management.
Master data management is not really part of data warehousing - but a data warehouse is less useful when a company does not have good Master Data Management practices established. It is not particularly useful for a DWH to gather corporate-wide data in one place if the inconsistencies between departments make it hard to compare the values anyway. It is therefore common for Master Data Management processes to be established at the same time as a DWH is built.
WARNING: Kimball points out that Master Data Management is inherently a very political process - it requires changing the way that various departments gather/represent data. It therefore needs to be driven by management that have the authority to make the upstream providers of DWH data actually accept and integrate the new master data feeds into their systems. The DWH architects and developers themselves are unlikely to have such business-level authority and will just be ignored.
Master Data Management usually has change-tracking controls to provide an audit trail of when master data was changed and by who. Some MDM systems have a kind of simplified entity-relationship-modelling tool built in so that non-technical users can define data structures and relations to hold master data without knowing SQL. These diagrams are also useful for browsing the available data; MDM also acts as a kind of “data catalog” for enterprise-wide standard tables of data. A specific entity may be owned by a “data steward” who is responsible for choosing the “official master data records”, eg lists of products or partner-companies.
None of the above has anything specifically to do with Data Vault; it is a common problem for all data warehouses. However there is one thing that a DWH can do to help. Creating a data mart in normal form may be useful as a first step towards master data management: first gather the data available in multiple systems, and make the union of this data available in normal form. Then dedup the data and store the result as “the master version” (such processing may take a lot of manual work). Then resubmit the master version back to the upstream systems as the official, deduped version.
DataVault on a big-data solution
The fact that hubs, links, and satellites have no logical need to change data once written is very friendly for many big-data systems (eg Hive, BigQuery). DV2.0 does suggest that mutable “last seen” dates for hubs and links can be useful for detecting records deleted upstream - but maybe that could be stored in something like HBase or a simple key/value datastore instead. It also suggests that “end dates” on satellite records is useful for export for data-marts; again there may be an external solution for that. These date-values are not part of the data model, just optimisations for import and export processes which can be implemented via other mechanisms if needed.
The primary problem when using big-data solutions (whether hive-like or hbase-like) is that DV2.0 does lots of lookups by key, while big-data solutions typically have very limited indexing ability (partitioning and clustering for hive, single rowkey for hbase).
I am not yet sure how easy it would be to get Data Vault running on a non-relational storage layer; please let me know if you have good references on this topic!
Quotes
Here are some interesting quotes from the Data Vault book:
A key difference of the Data Vault model, as compared to other modeling techniques, is that it allows bad data into the (Raw) Data Vault and applies business rules after loading the Data Vault. eg “same-as links” for deduplicating data, and “data quality services” such as replacing nulls with default values or clipping/discarding extreme values.
The goal of Data Vault 2.0 loads is to cover all data, regardless of its quality (“the good, the bad, and the ugly”).
Business users expect that the data warehouse will present correct information of high quality. However, data quality is a subjective concept. Data that is correct for one business user might be wrong for another with different requirements or another understanding of the business view. There is no “golden copy” or “single version of the truth” in data warehousing. Just consider the business user who wants to compare reports from the data warehouse with the reports from an operational system, including all the calculations, which might differ. In many cases, data warehouse projects divert from the calculations and aggregations in operational systems due to enterprise alignment or to overcome errors in the source system. This is not a desired solution, but is reality grounded in the fact that the data warehouse often fixes issues that should have been fixed in the source system or the business processes. In any case, the data warehouse should provide both “versions of the truth.” That’s why Data Vault 2.0 focuses on the “single version of the facts.”
Related Products
Attunity provides “model driven datawarehousing”. You define models, and attunity creates ETL code. Supports central data warehouse in 3NF or DataVault layout.