Categories: BigData

About This Article

Elasticsearch is a kind of document database with extended search features. This site has an overview of Elasticsearch which briefly mentions Elasticsearch aliases and what can be done with them. There are some interesting aspects of aliases which are not well covered in the official Elasticsearch guide; this article expands on that topic.

In particular, this describes a way to handle mapping updates over time, similar to how tools such as Flyway or Liquibase apply SQL schema changes to relational databases.

Aliases

An alias is a name which refers to one or more indices. The alias-name can be treated like an index-name in many ways:

queries can be executed against the alias, just like against a table
inserts can be executed against an alias - but only when the alias refers to exactly one index.

An alias can be seen in two ways:

as a property of an index (ie an index has zero or more aliases) which gives it an alternate name (which may be shared with other indices); or
as an independent entity (an alias points to one or more indices)

The first definition is more accurate. In particular, when an index is deleted, the alias no longer refers to that index (automatic) - and if there is no other index with the same alias, then the alias no longer exists. When querying ES to determine which “aliases exist”, the results are also returned as “list of tables with that alias”, also being consistent with the first representation. However the API for adding/removing aliases does seem to imply the second model.

Querying an Alias

Document ids are not unique across indices - or even between doctypes in the same index! Query results are always represented as a “list of hits” where each hit specifies (index, doctype, id) - so duplicated ids are no problem as long as the system is not storing ids as “the key”, ie treating them like relational database keys.

When document ids are being stored externally, care needs to be taken to ensure those ids are appropriate for the query being made. In particular, when using an alias which points to multiple indices, ids may not be unique.

Inserting into an Alias

If an alias points to more than one index, an error message is returned when performing an insert/update:

Alias [aliasA] has more than one indices associated with it [[index2, index1]], can't execute a single index op

Reindexing

Elasticsearch offers a REST API for “reindexing” an index - effectively a copy of all data from one index to another by reinserting the source of each document. This is very useful when it is necessary to modify the mappings of an existing index in a non-backwards-compatible way. It is also useful when upgrading between versions of Elasticsearch, to ensure all current data is “resaved” in the latest disk format.

Reindexing does preserve the _id property of documents, ie if an application has stored id values elsewhere, they are still valid after reindexing.

Interestingly, updating an existing mapping from integer to a string-type is not permitted. However reindexing a document where the origin index defines a field as integer and the target index defines the field as string-typed does work - because by default Elasticsearch “coerces” integers to strings and vice-versa.

Reindexing is currently implemented as a “standard extension module”, rather than being in the ES core code. This unfortunately means that the logic accessible via a REST API is not accessible via the Java transport-client API. As the project I was working on was using the transport-client API for performance reasons, I (re)implemented reindex in Java in about 50 lines of code; here is a Java utils-class with that implementation (see method copyIndex), and here is the corresponding junit test. The test relies on a helpful custom junit rule for interacting with an embedded Elasticsearch instance.

Partitioning Data

A common pattern in many “big data” databases is to have data stored in multiple partitions (files) so that unwanted data can be efficiently removed from the system by just “deregistering” that partition, and then deleting the associated datafiles.

ES aliases can be used in the same way - but with some limitations. Because an alias can point to multiple indices, data for various geographical zones or time-periods can be stored in different indices, and an alias can refer to the full set of indices. A query applied to the alias then is applied to each index. Deleting an entire index is efficient, and alias automatically no longer references it. As long as one index with that alias remains, applications performing queries will continue working as expected.

However because writes through an alias are only supported for aliases pointing to one index, applications which write data need to address the underlying index directly - or use an alias which points to just “the current” index into which data should be inserted. This asymmetry between reading and writing can be awkward.

More importantly, care needs to be taken when updating a document. The query will return results from multiple indices; an update must be written back to the same index otherwise a new copy of the document will be created instead of updating the old one, and a later query can return both versions.

In addition, the _id property assigned to a document is only unique within a specific (index, doctype). It is very unlikely that the same id will be auto-allocated for two different documents in different indices, but not impossible. Inserting the same document into two different indices will succeed as two inserts - and a query for that _id will return two hits: (index1, doctype1, id) and (index2, doctype2, id).

Updating Mappings

Over a business application’s lifetime, the format of data it stores evolves. With relational databases there are lots of tools and strategies associated with updating database schemas to match new releases of software. NoSQL databases which are “schemaless” or “schema-on-read” have fewer problems in this area, but Elasticsearch is not of this type - its indices/mappings are strictly-typed like relational systems. Elasticsearch does have the ability to define a mapping based on the first document inserted into the mapping, but the resulting mapping is still strictly typed. An application storing data in Elasticsearch therefore also needs a strategy for evolving mappings together with new releases of the application.

Here is the strategy I use for maintaining indices, mappings and aliases over application version releases. It’s an Elasticsearch equivalent of Flyway/Liquibase/etc for relational databases.

The application includes json files defining indices and mappings as resources in the classpath (in sourcecode, under src/main/resources/...). The application has a command-line option to start in “elasticsearch initialisation mode”, and the sysadmins are expected to use this when installing each new release. Combining the actual application¹ and the database-setup (rather than delivering a separate init-tool or scripts) simplifies things during installation and makes it difficult to use the wrong version for initialisation.

The commandline options may also include a list of “indices to reindex” (which can be “all”).

The list of managed-index-definitions is dynamically determined via classpath inspection; each index is represented by a json file with name matching the alias that the code refers to that index through. For each such file, the following is done..

If filename is in “indices to be reindexed” (as specified on the commandline):

If there is no alias matching the current name, report an error.
If the alias points to more than one index, report an error.
Parse the name of the index the alias points to; it should be of form name-v{N} and the N needs to be extracted
Create a new index of name name-v{N+1}, and apply all mappings from the json file to that index
Copy (reindex) all data from the old index to the new one
Change the alias to point to the new index
Delete the old index

Otherwise install or upgrade-in-place:

If there is no alias matching the filename then create a new index with name {filename}-v1, apply all the mappings from the file, and then define an alias {filename}->{filename}-v1.
Else determine which index the alias points to (error if more than one) then apply (PUT) the mappings contained in the file to that existing index. If Elasticsearch rejects the mapping (due to incompatible changes) then report this as an error, with the recommendation that the upgrade be reapplied with this index specified in the “reindex” list.

The result is that when mappings do not change for an index, then the upgrade-process is a “no-op” - it PUTs the identical mappings onto the existing index, which works fine without side-effects. The process can therefore be applied multiple times without problems. When the mappings do change, but are backwards-compatible with the old mappings then the index is just updated-in-place. When the mappings are incompatible, then reindex must be applied, incrementing the version suffix of the index-name. The alias used by application code always points to the appropriate index.

This also elegantly handles Elasticsearch version upgrades; simply specifying “reindex=all” ensures that all existing data is copied from the old storage format to the new storage format. Elasticsearch guarantees that it can always read data from the previous major release.

This process is not intended to be applied to a running system; in my case a traditional outage-window for upgrades is possible. It may be possible to use the same approach (or a modified version) for online (rolling) upgrades, but I have not spent any time considering the implications.

When designing this strategy, I did initially consider using aliases referencing multiple indices, but that turned out to not be feasible. In particular, writes against such indices are not allowed by Elasticsearch, and updates (read/write) of existing documents have related problems. However if you are storing purely time-series data (write-once) then aliases with multiple indices may be worth considering.

Useful REST Queries

Here are some HTTP requests that may be useful when testing alias-related behaviour of Elasticsearch for yourself..

# Show the current fields of mapping 'mapping1' in index 'index1'
GET localhost:9200/index1/mapping1/_mapping?pretty

# List all indices with aliases of 'aliasA'
GET localhost:9200/aliasA/_aliases?pretty

# Add a document of type 'mapping1' to index 'index1'
POST localhost:9200/index1/mapping1
{
  "name":"name2",
  "intval": 98
}

# ??
GET localhost:9200/index2/mapping1?pretty

# List all documents in all indices referenced by 'aliasA'
POST localhost:9200/aliasA/_search?pretty
{
  "query":{
    "match_all":{}
  }
}

# Define a mapping with no fields
PUT localhost:9200/index2/_mapping/mapping1
{
  "properties":{}
}

# Copy all data from 'index1' to 'index2'
# Sadly, this functionality is not available via the Java transport-client API
POST localhost:9200/_reindex
{
  "source": {
    "index": "index1"
  },
  "dest": {
    "index": "index2"
  }
}

# Modify index alias
POST localhost:9200/_aliases

{
    "actions" : [
        { "remove" : { "index" : "index2", "alias" : "aliasA" } }
    ]
}

References and Useful Links

QBox: ES Tutorials and Articles
Mike McCandless – blog from a Lucene expert

Footnotes

Actually, the application uses a microservices-architecture, and so is a group of applications. Each application contains the same init-functionality and holds resources which define the indices it is responsible for. ↩

About

Recent Posts

Categories

Elasticsearch Aliases