Yubikey, FIDO2 and Backups

As I wrote in my recent look at the Yubikey, it seemed to me that the rather primitive approach to backups taken by the Yubikey so far was just not going to be sufficient for the FIDO2/WebAuthn world, where the number of distinct credentials is going to be far larger.

Now that the Yubikey5 is out, with support for FIDO2/WebAuthn, I checked their documentation again - but couldn’t find any updated recommendations. As there is no apparent customer forum, I filed a support ticket asking about this. Unfortunately, the response was disappointing.

More databases - MemSQL and RocksDB

Seems like new databases (or at least database-like storage approaches) have been springing up like weeds in the recent years. I recently encountered mention of two ones I wasn’t really familiar with and so did a little reading. The following two articles provide a brief intro to:

  • MemSQL - a proprietary distributed relational-like database which keeps row-oriented tables completely in memory but supports column-oriented tables on disk
  • RocksDB - an open-source high-performance key-value store used often as background storage for more complex projects (eg Kafka Streams, Samza, MySQL)

Yubikey Concepts, Configuration and Use

Logging in to internet sites (and private servers) with just a password is really not acceptable these days, at least for someone (like me) claiming to be interested in IT security. I therefore recently bought a Yubikey-4 authentication token.

Sadly, the documentation available from the manufacturer, and the internet in general, was not very helpful. I have therefore created some extensive notes on the Yubikey-4 which may be useful if you are also considering buying one (or have already done so).

UPDATE: The Yubikey-5 is available (since late September 2018). Note that the above article also covers Yubikey-5 features.

On a similar topic, I have very brief notes on the pass commandline password-manager for Linux and totp commandline tools for Linux. All feedback is very welcome!

The Snowflake Data Warehouse

Storage Space Efficiency in Avro and HBase

I recently had a customer who suggested (for various reasons) storing large amounts of write-once data in HBase, using an (implicit) schema with long and complicated column names. I had immediate concerns about efficient use of disk storage with this approach (these were quite large amounts of data). Various sites warn about long column-names with HBase, but I could not find any actual statistics on it. A colleague and I therefore measured the efficiency of HBase with various column name lengths, and compared it to Avro.

Threat Modelling with STRIDE

A Lambda Architecure with Spark Streaming from Walmart Labs

Walmart Labs have posted an interesting article about analysing a clickstream with a lambda architecture using Spark-streaming and Spark batch.

I was part of a project that tried to do streaming processing with Spark a year or so ago. That didn’t go at all well; we had little resources and time, and (IMO) Spark-streaming was simply not mature enough for production.

One of the nasty problem we had was that landing data into Hive created large numbers of small files; Walmart Labs solve that by using KairosDB as the target storage instead. KairosDB is a layer on Cassandra, ie HBase-like.

Another serious problem with Spark-streaming is session-detection; it is possible but only with significant complexity. If I understand correctly, they solve that via the lambda archtecture - rough session detection in streaming, and better detection in the batch pass.

They still apparently had to fiddle with lots of Spark-streaming parameters though (batch duration, memory.fraction, locality.wait, executor/core ratios), and write custom monitoring code. And they were running on a dedicated spark cluster, not yarn. My conclusion from this is: yes Spark-streaming can work for production use-cases, but it is hard.

After my experiences, and some confirmation from this article, a solution based on Flink, Kafka-streaming, or maybe Apache Beam seems simpler to me. Those are all robust enough to process data fully in streaming mode, ie the kappa architecture.

While talking about Spark, here is an unrelated but interesting article on Spark for Data Science: the Good, Bad and Ugly.

The Graal Virtual Machine

Oracle are well known for the Java Virtual Machine project (inherited from Sun). They have now released version 1.0 of a general-purpose virtual machine called Graal that supports:

  • Java bytecode (production) - includes Java, Scala, Groovy, Kotlin
  • Javascript (production) - including Node.js applications
  • LLVM bitcode, ie apps compiled from C, C++, Rust and other languages via the LLVM compiler (experimental)
  • Python, Ruby, and R (experimental)

Code in these languages can call into other code running within Graal, regardless of the language it was written in! Arranging for additional libraries (including the language standard libraries) to be available requires some steps, but is possible.

Not only does this allow running apps in a “standalone” environment, it means that any larger software package which embeds the Graal VM and allows user code to run in that VM can support any language that Graal supports. Examples include database servers which embed the VM for stored procedure logic.

With Oracle, it is important to look at the licencing terms-and-conditions. This does initially seem to be OK; the code is completely licensed under the GPL2-with-classpath-exception, like OpenJDK. Oracle does warn that there is “no support” for the open-source code (aka “community edition”) and recommends that a support licence be bought for the “enterprise edition” instead - but OpenJDK is reliable enough, and so the Graal “community edition” will hopefully be so too.

The Graal project website has more information.