Diagnosing issues with Elasticsearch

We use logstash in our infrastructure, which ingests logs and outputs transformed data to a store of your choice, which for us is Elasticsearch. We then use Grafana to visualise some of this data. Recently we noticed that Grafana was struggling to perform the queries it needed, which led to investigating and fixing issues with Elasticsearch.

Elasticsearch uses a well-featured API for administration, which you can easily access using curl. The following all assume that you are running Elasticsearch locally on the default port 9200.

First things first: to get basic information about the Elasticsearch instance, run

curl http://localhost:9200/

which will give you the version for example, so that you can be sure you’re looking at the right version of any documentation. To get the status of the cluster, you can run

curl http://localhost:9200/_cluster/health?pretty

which should give you a response like this:

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 95,
  "active_shards" : 95,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 98.95833333333334
}

Including the pretty parameter will return a more human-readable response. There are various bits of useful information here. Firstly, the status will give you a general idea of how your cluster is doing: red means there are indices that are not available to query, yellow means that there are some that are not replicated, and green means everything is good. When the cluster first loads, you will notice that the shards go through the initializing_shards phase as they are spinning up, until they become active_shards. For us, around a third of the shards would reach the active phase before the service would restart itself.

You can find out information about the individual shards by running

curl http://localhost:9200/_cat/shards?v

which should give you an output something like this:

index               shard prirep state           docs   store ip        node
logstash-2018.02.27 0     p      STARTED      6219832   7.3gb 127.0.0.1 SL-GvDk
logstash-2018.04.04 0     p      INITIALIZING                 127.0.0.1 SL-GvDk
logstash-2018.04.04 0     r      UNASSIGNED
logstash-2018.02.28 0     p      STARTED      5765860   6.8gb 127.0.0.1 SL-GvDk
logstash-2018.02.13 0     p      STARTED      6810856   7.9gb 127.0.0.1 SL-GvDk

Using the v parameter will give you headers in your output for many queries. There will be a line per shard in your cluster. INITIALIZING shards are currently spinning up on a node, and will then move to STARTED when ready to query. UNASSIGNED shards are ones which are waiting to be assigned to a node. We are only running a single-node cluster, so the replicas will never be assigned, meaning we would never get a green status. You can force all your existing indices to not have any replicas by running

curl -XPUT http://localhost:9200/_settings -d '{ "index.number_of_replicas": 0 }'

However, this would not affect any indexes that are then created in future. To ensure that replicas are not created in the future, you need to edit the index template being used to ensure that new ones will be created with the correct settings.

This cleaned up all our replica shards which would never be assigned, but still didn’t solve the problem of the endless rebooting. Looking at the logs (stored at /var/log/elasticsearch by default) showed that several options were erroring with

java.lang.OutOfMemoryError: Java heap space

So I increased the heap space by updating the JVM options, which can be found at /etc/elasticsearch/jvm.options; there are two options which need to be updated and kept the same as each other:

-Xms2g
-Xmx2g

This sets the heap to 2GB, which for us was sufficient to get everything working again after restarting the service:

/etc/init.d/elasticsearch restart

Now that everything was running again, we could review whether we needed all those indices still; you can list all indices by running

curl http://localhost:9200/_cat/indices

and then close any you don’t want any more by running something like:

curl -X POST http://localhost:9200/logstash-2018.04.*/_close

where wildcards are accepted. However, this won’t delete the data on disk, it will just free up memory. If you want to permanently delete the data from disk as well, you can run:

curl -X DELETE http://localhost:9200/logstash-2018.04.*