I stumbled across the (proprietary) MemSQL database while reading the Data Engineering Weekly News, due to this article on Pandora building dashboards with MemSQL. I found the description interesting, and did a small amount of additional research; this article is simply the resulting notes.
MemSQL is basically a relational database, supporting the usual concepts of tables, rows and columns. It is distributed, ie can be scaled out horizontally to many nodes.
While the “mem” part of the name suggests this is a purely in-memory database, that isn’t entirely true. A table can be declared as “row-oriented” or “column-oriented”; row-oriented tables are persisted to disk, but must be completely loaded into memory in order to be queryable - ie the cluster must have sufficient RAM to hold all (active) row-oriented data. Column-oriented tables are persisted to disk, and do not need to be comletely read into memory. A MemSQL table can also be configured to be in-memory-only, ie not persisted.
Row-oriented tables can be used effectively for “operational” tasks, ie heavy read/write loads which affect just a few rows at a time. Examples are call-center systems, shopping carts.
Column-oriented tables can be used effectively for “analytic” tasks over very large datasets, ie queries which scan large numbers of rows. Examples are reporting and compliance tasks.
Databases which use column-oriented storage are often not particularly good at streaming ingest or updates. MemSQL solves this by combining the two modes; changes to a column-oriented table are buffered in-memory using the row-oriented approach and only occasionally flushed to disk. Row-oriented tables also perform well for analytics tasks, but are limited to datasets that fit into memory.
As with most databases, write-ahead-logs are used for error recovery.
As with most distributed databases, there is no support for foreign-key constraints, or uniqueness constraints except on the primary key.
- almost complete standard SQL support, including grouping/aggregation/windowing functions
- clusters in impala-style: aggregator and leaf nodes
- data sharded using hash(column+) where the column is specified at table definition (default: primary key)
- data replicated to ONE other node IFF system set to “high availability”; sync via “log shipping”
- supports “reference tables” which are replicated to every node
- indexes are supported on rowstore tables (primary and multiple secondary indexes)
- indexes are not supported on columnstore tables
- native Kafka support, ie can ingest data directly from a Kafka topic
- can be accessed using MySQL client libraries (eg JDBC driver)
- works well with Spark
- requires Linux as host OS
In short, it looks rather like a combination of Cassandra and Kudu - or like Impala + Kudu.
It would be interesting to compare performance of MemSQL with Kudu.
The MySQL driver compatibility looks useful - that opens the door to using a lot of standard reporting software.