The Snowflake Data Warehouse

Storage Space Efficiency in Avro and HBase

I recently had a customer who suggested (for various reasons) storing large amounts of write-once data in HBase, using an (implicit) schema with long and complicated column names. I had immediate concerns about efficient use of disk storage with this approach (these were quite large amounts of data). Various sites warn about long column-names with HBase, but I could not find any actual statistics on it.

A colleague and I therefore measured the efficiency of HBase with various column name lengths, and compared it to Avro.

Threat Modelling with STRIDE

A Lambda Architecure with Spark Streaming from Walmart Labs

Walmart Labs have posted an interesting article about analysing a clickstream with a lambda architecture using Spark-streaming and Spark batch.

I was part of a project that tried to do streaming processing with Spark a year or so ago. That didn’t go at all well; we had little resources and time, and (IMO) Spark-streaming was simply not mature enough for production.

One of the nasty problem we had was that landing data into Hive created large numbers of small files; they solve that by using KairosDB as the target storage instead; KairosDB is a layer on Cassandra, ie HBase-like.

Another serious problem with Spark-streaming is session-detection; it is possible but only with significant complexity. If I understand correctly, they solve that via the lambda archtecture: rough session detection in streaming, and better detection in the batch pass.

They still apparently had to fiddle with lots of Spark-streaming parameters though (batch duration, memory.fraction, locality.wait, executor/core ratios), and write custom monitoring code. And they were running on a dedicated spark cluster, not yarn. My conclusion from this is: yes Spark-streaming can work for production use-cases, but it is hard.

After my experiences, and some confirmation from this article, a solution based on Flink, Kafka-streaming, or maybe Apache Beam seems simpler to me. Those are all robust enough to process data fully in streaming mode, ie the kappa architecture.

While talking about Spark, here is an unrelated but interesting article on Spark for Data Science: the Good, Bad and Ugly.

The Graal Virtual Machine

Oracle are well known for the Java Virtual Machine project (inherited from Sun). They have now released version 1.0 of a general-purpose virtual machine called Graal that supports:

  • Java bytecode (production) - includes Java, Scala, Groovy, Kotlin
  • Javascript (production) - including Node.js applications
  • LLVM bitcode, ie apps compiled from C, C++, Rust and other languages via the LLVM compiler (experimental)
  • Python, Ruby, and R (experimental)

Code in these languages can call into other code running within Graal, regardless of the language it was written in! Arranging for additional libraries (including the language standard libraries) to be available requires some steps, but is possible.

Not only does this allow running apps in a “standalone” environment, it means that any larger software package which embeds the Graal VM and allows user code to run in that VM can support any language that Graal supports. Examples include database servers which embed the VM for stored procedure logic.

With Oracle, it is important to look at the licencing terms-and-conditions. This does initially seem to be OK; the code is completely licensed under the GPL2-with-classpath-exception, like OpenJDK. Oracle does warn that there is “no support” for the open-source code (aka “community edition”) and recommends that a support licence be bought for the “enterprise edition” instead - but OpenJDK is reliable enough, and so the Graal “community edition” will hopefully be so too.

The Graal project website has more information.

The New Oracle Java Release Cycle

Oracle have changed the way they release new versions of the Java JDK and the Oracle Java Virtual Machine.

Google Cloud Functions, BigQuery, and Related Matters

Apache Beam and Google Dataflow Overview