Data Vault 2 on Big Data Platforms

Categories: Programming, BigData

WARNING: Article In Progress!

Actually, it is fairer to describe the current state of this page as just a bunch of notes and links, that hopefully will some day be a coherent article about data vault on big data platforms such as Hive.

If you do have some information on DV2 and big data, please share!

Overview

The Data Vault 2 data warehousing pattern (as documented in Dan Linstedt’s book) assumes a relational database as underlying data storage. However relational systems have limits to scaleability - and become increasingly expensive when approaching those limits. Big-data platforms such as Apache Hive or Google BigQuery are tempting as the underlying storage for a data warehouse, due to their scaleability and price, but they have limitations that apparently conflict with the requirements for implementing Data Vault 2. In particular, Hive and BigQuery do not like performing updates on existing data, and do not support table indexes.

A blog-post from Dan Linstedt in 2015 describes the problems of data-vault on big data platforms.

The Problem In Detail

TODO: summarize the list of problems presented in Dan’s post above, and look specifically at some of the select/insert statements in the DV2 book which require either modifying data, or having indexes.

DV2 as an insert-only system

A post from Dan in 2016 suggests that in-place mutations of tables in Data Vault are not necessary, which would help greatly with big-data platforms. Apparently, this relies on building point-in-time tables and bridge-tables, though sadly the details are somewhat lacking. Dan does state that his course includes information on “Data Vault on Hive”.

TODO research how exactly point-in-time and bridge tables should be used to make DV2 an “insert only” system..

Alternatives

I stumbled across a presentation on anchor modelling vs data-vault - WTF is “anchor modelling”?

TODO

See DV2 with Airflow.