Categories: BigData

Introduction

The big data storage article on this site briefly describes Apache Hive in the context of many other storage technologies. This article addresses Hive specifically.

What is Hive

Hive is primarily a “query language compiler” which converts HiveQL statements (similar to SQL) into programs which can then be executed on a YARN cluster (via MapReduce, Tez or Spark). The end result is the ability to execute SQL-like queries in parallel against large datasets stored in a range of different formats.

Hive allows data stored in various file-formats to be queried as if it were in a traditional database table. Any format can be supported as long as the necessary driver for that format exists; standard file-formats include plain comma-separated-values, the slightly more sophisticated SequenceFile format, or advanced file formats such as AVRO, Parquet and ORCFile. There are also adapters for some datastores which work by sending network requests to the relevant datastore servers.

Hive also supports insert and update operations as long as the driver for the relevant format supports those operations. Hive isn’t suitable for many small updates (eg OLTP-style workloads); the insert/update functionality is primarily used to “derive” tables from others (similar to “materialized views”) for use during analysis, or to batch-import data from other systems.

To provide an SQL-compatible representation of its datasource, Hive requires the format (column names and types, ie a schema) of each input “table” to be declared. These schemas are independent (separate) from the actual datasource, and are stored in a hive metastore. The separation of schema and data does of course expose the danger that the two do not match. Some file-formats (eg AVRO) are fully self-describing, allowing the declared schema (which was used to compile the query) to be validated against the actual input data. The HCatalog component of Hive provides access to the Hive metastore (ie the set of declared schemas) from other tools.

It could be said that Hive is a “deconstructed” database that separates the query-engine from the storage layout.

Hive can be very effective at OLAP-style workloads, in which long-running queries are performed against very large sets of data. Hive’s ability to analyse data directly from files without a performance-intensive “import” step is also very useful. Due to the latency of compiling queries and executing them on a YARN cluster, it is not suitable for “quick interactive queries” (though the LLAP project is working on that), nor is it suitable for OLTP-style workloads (lots of queries each retrieving a few records).

Logfile analysis provides a good example of the ability of Hive to apply an external schema to an existing file, and then perform queries against the original (unmodified) data. Given a typical application logfile in csv format or similar, and a corresponding schema definition, HiveQL (SQL-like) queries can be used to extract data out of these logfiles in a much more elegant way than using grep/awk. Given huge logfiles stored in HDFS, such analyses can be executed in parallel using a cluster which is rather hard to do with grep…

With some storage formats (eg ORCFile), Hive also supports SQL insert/update/delete operations. However Hive isn’t really designed for heavy update loads; it is more focused on read/analysis usecases.

To quote from the hive wiki page:

Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.

See:

The Free Hive Book

Supported Data Sources

Having “support” for a file format in Hive actually means having support for that format from an HMapReduce/Tez/Spark java program. The Hive project defines interface org.apache.hadoop.hive.serde2.SerDe and some related classes; the HMapReduce/Tez/Spark code that Hive generates then reads/writes data via this interface. Implementation of this interface for a specific file format is called a SerDe, and they exist for many file formats with the following being the most commonly used with Hive:

TEXTFILE (sequence of linefeed-terminated lines, each line being a record in csv, tab-separated, or similar format)
SEQUENCEFILE (sequence of serialized java objects)
ORC - Optimised Row Columnar format for storing relational tables
RCFILE - predecessor of ORC

Note that the Hive SerDe implementations for various formats can also be used by hand-written code if desired.

Hive can also generate code to access data via the more general StorageHandler interface, which allows access to data from non-file sources, particularly:

HBase - the HBase Storage Handler makes GET/SCAN calls to HBase servers over a network, and exposes data location information so Hive’s generated HMapReduce/Tez/Spark application can efficiently place its tasks near the data to be processed.

Access from Hive to data in Cassandra currently appears to be less well supported than for HBase. The Cassandra project does provide code for accessing Cassandra from HMapReduce programs, but this code does not implement the Hive SerDe interface, and therefore cannot be used with Hive. There are several Hive/Cassandra projects on GitHub, including Edward Capriolo’s hive-cassandra-ng. The Cassandra project (and the DataStax company behind much Cassandra development) has switched back and forth between supporting Hive and HMapReduce, and removing such support, several times over past years. Currently, it seems that their policy is to support only Spark, although this datastax blog entry is interesting.

Embedded vs Server Mode

Hive can be used as a library from a standard java application. In this mode, HiveQL statements are compiled within the client application, producing an HMapReduce/Tez/Spark-based “master application” which is then submitted to the cluster (eg YARN) for execution.

The Hive project also provides a stand-alone server (“hiveserver2”). HiveQL statements can be sent to this server via REST or Thrift; a JDBC driver is provided which uses Thrift to forward statements. Multiple instances of this server can be run if desired; they don’t “manage” the underlying data in any real sense, so don’t need to cluster or communicate between themselves in any way.

In both cases, the metadata that specifies table format and table-partition-files needs to be stored in a central location. HCatalog provides this functionality.

With the “embedded” approach, the client application needs direct access to the YARN resource manager in order to submit the application, and access to the HCatalog instance (or metastore database directly). Executing queries via hiveserver2 makes network access configuration simpler: the client application needs only access to a hiveserver2 instance. Executing a query via hiveserver2 does require an extra network “hop”, but as Hive queries are not “low latency”, this is not usually a problem.

Loading Data into Hive

Data can simply be written into a distributed filesystem using some format that Hive supports (and there are many) - including simple formats such as a sequence of lines with comma-separated-values. HiveQL commands can be used to tell Hive how to interpret the data in the file as rows/columns (“create table”); this data is stored in a central metadata repository. Finally, Hive commands can be used to tell Hive which files belong to the table (“add partition”). Queries can now be executed against the data “in place”, ie with no further “load” step. For tasks such as analysing webserver logfiles, this can be a major simplification over somehow importing the data (in relational-form) into a database in order to perform analysis.

Alternatively, data can be inserted into tables managed by Hive - as long as the chosen format is one that supports inserts. Declare the table in Hive (“create table”), then perform insert-statements which Hive passes along to the format “adapter” class, which writes the data to the filesystem. This approach is similar to loading data into a relational system.

Multiple Back-ends

The standard Hive distribution comes with support for compiling to HMapReduce, Tez or Spark. The hive configuration files specify which is the default, but any client application can choose which back-end to use by simply sending a “use {engine}” command, eg “use spark;”. That command specifies the target engine for the current session.

Of course, the hive configuration must also provide the necessary contact details for the target environment, eg spark-specific settings must be defined in order for “use spark” to work correctly.

The Hive Metastore and HCatalog

In order to properly apply SQL statements to data in formats such as a textfile holding lines consisting of comma-separated-values, Hive needs to know which column names and types to associate with data in input records. This information is provided via a HiveQL “create table” statement, and this information is stored in a metastore.

The term “metastore” really refers to a software module within Hive, which other Hive code uses to write and read such metadata. Somewhat ironically, the metastore code currently persists metadata in a relational database (eg Postgres or MySQL), although Hive 2.0 (released 2016) includes support for storing it within an HBase database instead.

Because the metadata is stored in an external database, multiple Hive instances (whether “embedded” or HiveServer processes) can share a common metastore database. However there are disadvantages to having multiple Hive instances simply accessing metadata by directly reading/writing a database, including:

no shared caching between metadata users
direct network access to the database instance needed
database login credentials needed
potential issues when different clients have different release versions

A more elegant solution to managing metadata is therefore to run an HCatalog server which acts as a “front end” to the metadata database. The “metastore module” within Hive applications then communicates with the HCatalog server using a purpose-designed network protocol rather than performing direct SQL against the underlying database. HCatalog offers a Thrift-based interface (performant), and a REST-based interface (flexible) for querying and updating table-related metadata. This approach also allows non-hive applications to query the same metadata, for example Apache Pig scripts can reuse the schema declarations defined via Hive.

HCatalog server instances are still just “front ends” for the underlying database (whether relational or HBase), and therefore it is possible to run multiple HCatalog instances simultaneously if desired; they do not need to share any state (other than that in the common database) and thus need no special clustering support.

The ORCFile Format

The most sophisticated and flexible of all the currently supported Hive file-based storage formats is ORCFile. This uses a column-store layout for efficient querying.

ORCFile supports inserts and updates by creating new files in the usual SSTable/log-structured-merge-tree approach. This is efficient for batch-mode inserts and updates; records should not be inserted one-by-one as this leads to very large numbers of store-files (there is no long-lived “mem-store” as with HBase or Cassandra).

Reference:

ORCFile Documentation

Custom Functions

Hive supports “UDFs”, User Defined Functions. This allows a developer to write some java code, and then call it from within a HiveQL query. Hive takes care of the compilation of the query to efficient code, and distribution of the work over a cluster, and the custom java code then gets called when appropriate. Common uses for UDFs are custom comparisons and field formatting.

Fault-Tolerance and Scalability

Because Hive uses HMapReduce/Tez/Spark to execute queries, it inherits the fault-tolerance of these systems. All these back-ends track the status of the individual tasks that make up the overall “application”, and restart any tasks that fail. This means that a long-running Hive query will not fail just because a single node crashes or a single network connection fails - particularly important when running analysis queries over large data volumes which may take hours to complete.

The use of these technologies also means that data storage capacity and processing capacity are delegated to the underlying distributed fileystem and cluster. HDFS and YARN can build storage and processing clusters consisting of thousands of nodes, as can alternative technologies, and Hive queries can take advantage of that.

Querying Nested Data

Some data storage formats, including AVRO and Parquet, support “nested” data, ie records where some fields are themselves records.

HiveQL (unlike standard SQL, eg that supported by Drill) supports querying nested data. However referencing very deeply nested data is still somewhat awkward - languages such as XQuery are more elegant for this usecase.

Indexes

To quote from the relevant Hive Wiki page:

The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like ‘WHERE tab1.col1 = 10’ load the entire table or partition and process all the rows. But if an index exists for col1, then only a portion of the file needs to be loaded and processed.

The improvement in query speed that an index can provide comes at the cost of additional processing to create the index and disk space to store the index.

Index data is stored “per table partition”. When table-partitioning changes, it is necesary to run “alter index .. rebuild” to update the index too.

Building/rebuilding an index executes a mapreduce job to generate the index-file. This implies that this feature helps only for queries that exclusively use the indexed fields - ie for “covered” queries. So is actually equivalent to executing a query that generates a completely separate table, and writing queries against that table.

Indexing is particularly useful when searching for relatively rare records

For formats like ORCFile, a “compact” index points to the HDFS block holding the data, and to the record-block within that HDFS block, but not to the exact record - a scan is then used to find all matching records in that record-block. This is useful for things like looking up a webserver-log by session-id: within a short time-period (a few adjacent blocks) there may be multiple matching records.

Bitmap indexes work similar, except that there is also a bitmap with one bit per record in the block (eg blocks of 1024 records have a 1024-bit bitmap), indicating which records match the index. This makes count(…) very efficient. The bitmap can be compressed with run-length-encoding, significantly reducing space when many or few records match the index.

Access Control

Hive provides two kinds of access-control:

SQL-based: defining roles in Hive itself, and associating roles with tables;
storage-based: delegating to HDFS

The SQL-based approach is familiar to anyone who has used a relational database. However it is applied by Hive only for operations that are performed through Hive; Hive is just a tool for querying files in HDFS and when other tools exist or direct access to the underlying files in HDFS is possible then these checks are ineffective.

In the storage-based approach, Hive allows a user to query a table if-and-only-if that user has read access (at the HDFS level) to the datafile that the table refers to. Access-control is less fine-grained than the SQL-based approach, but is often easier to manage and harder to bypass. Combined with tools like Ranger which provide sophisticated access-control for HDFS files, this can be an effective solution.

Bucketing

HiveQL offers a feature named “bucketing”. This is effectively equivalent to “group by hash(column) % N” where N is the “number of buckets”.

Each “group” therefore has a random and roughly-equal subset of the rows, where all records with column.value=Z are guaranteed to be in the same group. The usefulness for joins is obvious.

Additionally, the records within a bucket (group) may be sorted by the column, making joins even more effective (merge-sort).

Transform Scripts

Hive generates MapReduce programs for SQL commands. Because Hive knows the location of datafiles associated with specific “tables”, it can also be convenient to give a MapReduce program (as a script) to Hive and ask Hive to execute it. The TRANSFORM statement within hive-ql runs a user-provided script as a map or reduce command - without complicated map/reduce boilerplate.

For each input record, the columns are converted to a single textline containing a tab-separated list of strings. The line is then written to the STDIN of the specified transform script. The script can write lines in the same format to STDOUT. The statement can give “names” to the output values via a following “AS col1,col2,..”; if that is not present then the first col is named “key”, and all following text is named “value”.

Map and reduce operations are identical - both simply a stream of input records and a stream of output records. The only difference is which records: the “map” stage is given a random subset of records, while the “reduce” is given the full set of records for a specific “key partition (distributeBy)” - and records are guaranteed to be sorted.

Views

Hive supports “views” similarly to relational databases - a kind of “select macro”. Writing a select referencing the “view definition” as the source-table generates an “expanded” select which merges the original view-definition with any other terms added by the caller. Views are read-only.

Impala

Impala is an attempt to build a “faster Apache Hive”, at least for simpler queries. It is open-source (github.com), but mostly sponsored by Cloudera which is a 1000+ employee, USA-based company with a co-creator of Hadoop as its Chief Architect and various ex-facebook/yahoo/google/oracle managers. It appears to be reasonably open-source-friendly and earns income via consulting and training.

As describled in the Impala overview, Impala is intended to execute queries with less latency than Hive. For current versions of Hive, each query is compiled into a MapReduce or Spark application, which is then executed in a YARN cluster. For long-running queries (those that process large amounts of data and take minutes or hours to complete), the compilation and scheduling steps are not significant. However there are use-cases where users wish to perform queries against small amounts of data (for prototyping, interactive exploration of datasets, etc) - either via small databases, or by using appropriate limiting operations in the query such as selecting single data partitions.

Impala optimises for the interactive case by starting a number of long-running daemons to execute queries, rather than starting processes on demand. These processes can also interpret the submitted query reasonably directly, rather than require it to be compiled to MapReduce or similar form, also saving some startup time for each query.

From the documentation, “Impala SQL is a subset of HiveQL”.

From other entries in the FAQ, it appears that Impala is not a 100% replacement for Hive, and is actually complementary. Quick/real-time queries can be done in Impala while trickier cases not supported by Impala must be done via Hive. The FAQ also recommends that “batch processing” be done via Hive.