Typesafe Config

Just wanted to point out a Java library which is actually reasonably well known anyway - TypeSafe’s config library.

This provides an API for loading configuration data from external files. Among other things, it allows properties files to:

  • include references to variables (whose values can be defined as sysvars, in code, or in the config files)
  • include the contents of other files
  • define times with syntax such as “10 seconds” and memory-sizes such as “512k”

More interestingly it supports a superset of JSON called HOCON which allows comments, and which removes the verbosity and unforgiving punctuation requirements of JSON while retaining its powerful nested structure.

Accessing Hive via JDBC

Hive container is running beyond physical memory limits

I use the hive commandline tool to make queries against hive tables. Recently, a query failed with the error message “container is running beyond physical memory limits”.

It took me quite a while to figure out what was happening, and how to work around it. My notes can be found here.

It’s a shame that Tez/Hive don’t handle this automatically. Relational databases never report “out of memory” when running a query just because the source table is particularly large. On the other hand, this table was so large that no relational database could ever have held it…

UPDATE: Shortly after solving the above problem, I struck another out-of-memory problem in Hive which is discussed here. Fun, fun, fun…

Spark Overview

I recently made a presentation on Spark to a group of work colleagues - Data Scientists, Data Engineers, and Operations. Here are the notes I prepared for the presentation.

Scala Overview

As is obvious from this blog, I am a software developer who mostly implements software in Java (though I have used several other languages in the past). However over the last couple of months I’ve been using the Scala programming language for the first time. I’ve made some notes of the things I’ve learnt about Scala from a Java developer’s point of view - for those who might be interested, here are some notes on Scala for Java developers, and here are some more advanced notes on Scala’s pattern-matching.

Kafka Connect JDBC Source Where Clauses

The Kafka Connect JDBCSourceConnector reads from a relational database and outputs each row as a message in a kafka topic.

The config-file supports specifying the data to read as either a table-name (table.whitelist) or a custom query (query). Unfortunately, the documentation states clearly that when option query is used, and “incremental load” is also enabled then the query must not include any “where” component as the connector will itself add a where-clause and this will result in invalid SQL syntax.

Is there anything that can be done about this? For most databases, the answer is yes…

A Git Repo Mirroring Multiple Remotes

I recently had to create a single Git repo (company internal) holding a mirror of several other projects (from Github). While Git can do this, its default behaviour is to mix all tag names into the same namespace, leading to rather confusing results. My solution is documented here.

Java 9's Jigsaw Module Framework (JPMS)

One of the big features for the upcoming Java version 9 is supposed to be the Jigsaw (aka Java Platform Modular System or JPMS) module framework. However it has been controversial over its whole development cycle - and now that the release is coming up, some non-Oracle groups on the advisory board are intending to vote against the release of Java 9 due to flaws in Jigsaw.