The Data Engineering Weekly newsletter is well worth reading. A recent article mentioned the Snowflake database (data warehouse) which I had not previously heard about; if you also work with data warehouses then perhaps my reasearch notes may be useful.
I recently had a customer who suggested (for various reasons) storing large amounts of write-once data in HBase, using an (implicit) schema with long and complicated column names. I had immediate concerns about efficient use of disk storage with this approach (these were quite large amounts of data). Various sites warn about long column-names with HBase, but I could not find any actual statistics on it.
A colleague and I therefore measured the efficiency of HBase with various column name lengths, and compared it to Avro.
I’ve been doing some thinking and reading about software security recently, and in particular using threat modelling to find security problems with IT systems. I have written up some thoughts on threat modelling here.
I was part of a project that tried to do streaming processing with Spark a year or so ago. That didn’t go at all well; we had little resources and time, and (IMO) Spark-streaming was simply not mature enough for production.
One of the nasty problem we had was that landing data into Hive created large numbers of small files; they solve that by using KairosDB as the target storage instead; KairosDB is a layer on Cassandra, ie HBase-like.
Another serious problem with Spark-streaming is session-detection; it is possible but only with significant complexity. If I understand correctly, they solve that via the lambda archtecture: rough session detection in streaming, and better detection in the batch pass.
They still apparently had to fiddle with lots of Spark-streaming parameters though (batch duration, memory.fraction, locality.wait, executor/core ratios), and write custom monitoring code. And they were running on a dedicated spark cluster, not yarn. My conclusion from this is: yes Spark-streaming can work for production use-cases, but it is hard.
After my experiences, and some confirmation from this article, a solution based on Flink, Kafka-streaming, or maybe Apache Beam seems simpler to me. Those are all robust enough to process data fully in streaming mode, ie the kappa architecture.
While talking about Spark, here is an unrelated but interesting article on Spark for Data Science: the Good, Bad and Ugly.
Oracle are well known for the Java Virtual Machine project (inherited from Sun). They have now released version 1.0 of a general-purpose virtual machine called Graal that supports:
- Java bytecode (production) - includes Java, Scala, Groovy, Kotlin
- LLVM bitcode, ie apps compiled from C, C++, Rust and other languages via the LLVM compiler (experimental)
- Python, Ruby, and R (experimental)
Code in these languages can call into other code running within Graal, regardless of the language it was written in! Arranging for additional libraries (including the language standard libraries) to be available requires some steps, but is possible.
Not only does this allow running apps in a “standalone” environment, it means that any larger software package which embeds the Graal VM and allows user code to run in that VM can support any language that Graal supports. Examples include database servers which embed the VM for stored procedure logic.
With Oracle, it is important to look at the licencing terms-and-conditions. This does initially seem to be OK; the code is completely licensed under the GPL2-with-classpath-exception, like OpenJDK. Oracle does warn that there is “no support” for the open-source code (aka “community edition”) and recommends that a support licence be bought for the “enterprise edition” instead - but OpenJDK is reliable enough, and so the Graal “community edition” will hopefully be so too.
The Graal project website has more information.
Oracle have changed the way they release new versions of the Java JDK and the Oracle Java Virtual Machine.
I have recently added some new articles related to the Google cloud:
- Google Cloud Storage Overview
- Google Cloud Functions Overview
- Google Databases Overview
- Google BigQuery Overview
- SQL Analytic Functions on BigQuery
- Dealing with Mutable Dimension Tables in BigQuery
- Beam/Dataflow: how to send a Pubsub message after processing completed
As usual, these are notes-to-myself. Use at your own risk! Feedback very welcome.