Hive, JDBC and Array-typed Columns

Categories: BigData

Introduction

Relational databases traditionally support only primitive types for columns (int, string, date, etc). The SQL1999 standard does define an array type, ie allows a single column in a relational table to hold a sequence of values, class java.sql.Types defines an ARRAY type, and class java.sql.ResultSet offers a method getArray(int columnNum). However in practice, very few relational databases ever implemented arrays (PostgreSQL being a notable exception), as it is actually a kind of “denormalization” and the same relation can be achieved by storing the values in a separate table with an appropriate common selector value. JDBC drivers for such databases usually just throw an UnsupportedOperationException if ResultSet.getArray is invoked.

Although relational DBs don’t generally provide arrays, “big data” databases often support columns which hold arrays. In particular the Hive database supports array-typed columns, and I wanted to find out if such columns could be accessed via the JDBC driver for Hive.

Hive is available as:

  • a library which can be embedded into a Java application; the library offers an API for SQL-like operations which it implements by launching tasks in an external Mesos, Yarn or Spark cluster
  • a standalone server (or cluster of servers) called hiveserver2 which can be sent SQL-like requests over a network connection using its internal (thrift-based) network protocol.

Here we are talking about hiveserver2 and the associated JDBC driver which communicates with it using the server’s internal network protocol.

Connecting to Hive with JDBC

The specific hiveserver2 instance I was connecting to was secured with both SSL and Kerberos, and running in “high availability” mode where the addresses of the individual servers are held in Zookeeper. The connection URL passed to the JDBC driver therefore looked like:

jdbc:hive2://zkhost:zkport/;ssl=true;transportMode=http;serviceDiscoveryMode=zooKeeper;principal=hive/_HOST@PRODUCTION.LOCAL

Note that the principal field identifies the Kerberos principal for the service being connected to. The local Kerberos TGT (ticket-granting ticket) is then used to create a service-ticket from the local user to the specified service principal. The local user (as specified in the current Kerberos ticket or the JAAS config file) must have rights to access that service.

I did initially encounter some problems authenticating with Kerberos; the solution can be found here.

The Hive version I needed to connect to was v1.2.1 (bundled in the Hortonworks Data Platform v2.5.3) - which is now somewhat old. Drivers for Hive earlier than v2.0 are unfortunately not in the central Maven repositories; they are available:

  • from a Hortonworks Maven repository - but unfortunately only over http which is not desirable for security reasons
  • from a Spring repository (https://repo.spring.io/plugins-release/)

Using method getArray

Testing showed that:

  • Hive JDBC driver version 1.2.1000.2.4.2.10-1 does not support optional JDBC method getArray(); that throws an UnsupportedOperationException. Calling getObject on an array-typed column returns a single string (and yes, it really is a plain string) which holds joined elements in form “[a,b,c]”. See later for more information on this..

  • Hive JDBC driver version 2.0.1 also works with Hive 1.2.1 (yay). But still does not support getArray (boo).

  • Hive JDBC driver version 2.1.0 does not work with Hive 1.2.1 (“required field client_protocol is unset”).

Using JSON

Although using method getArray was a failure, I did eventually notice that the Hive wiki page states that the string which gets returned for array fields is actually JSON. So array fields can be read - by fetching them as strings and then passing that string to a JSON deserializer. That is actually a reasonably elegant solution, as it also works for map fields and struct fields - for which there are no corresponding method in the JDBC driver API.

Beeline

Just for reference, here is the URL for connecting to the same hiveserver2 instance via the beeline Hive interactive console (which uses JDBC internally):

beeline -u "jdbc:hive2://zkhost:zkport/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;\
  ssl=true;transportMode=http;httpPath=cliservice;principal=hive/_HOST@PRODUCTION.LOCAL?tez.queue.name=..."

Creating Test Data

Command insert into .. values (...) in Hive currently does not support UDFs in the values list - including the functions that create array and map objects. So inserting values into such fields interactively from a Hive commandline tool is tricky. The solution is:

-- do once
create table dummy (s string);
insert into table dummy values ('one');

-- now use repeatedly
create table tmp (f1 string, f2 array<string>, f3 int);
insert into tmp select 'blah', array('s1', 's2', 's3'), 12 from dummy;