In depth guide to running Elasticsearch in production

// EDIT, 6th October 2020. I am moving my blog and this article to medium.

If you are here, I do not need to tell you that Elasticsearch is awesome, fast and mostly just works.
If you are here, I also do not need to tell you that Elasticsearch can be opaque, confusing, and seems to break randomly for no reason. In this post I want to share my experiences and tips on how to set up Elasticsearch correctly and avoid common pitfalls.
I am not here to make money so I will mostly just jam everything into one post instead of doing a series. Feel free to skip sections.

The basics: Clusters, Nodes, Indices and Shards

If you are really new to Elasticsearch (ES) I want to explain some basic concepts first. This section will not explain best practices at all, and focuses mainly on explaining the nomenclature. Most people can probably skip this.

Elasticsearch is a management framework for running distributed installations of Apache Lucene, a Java-based search engine. Lucene is what actually holds the data and does all the indexing and searching. ES sits on top of this and allows you to run potentially many thousands of lucene instances in parallel.

The highest level unit of ES is the cluster. A cluster is a collection of ES nodes and indices.

Nodes are instances of ES. These can be individual servers or just ES processes running on a server. Servers and nodes are not the same. A VM or physical server can hold many ES processes, each of which will be a node. Nodes can join exactly one cluster. There are different Types of node. The two most interesting of which are the Data Node and the Master-eligible Node. A single node can be of multiple types at the same time. Data nodes run all data operations. That is storing, indexing and searching of data. Master -eligible nodes vote for a master that runs the cluster and index management.

Indices are the high-level abstraction of your data. Indices do not hold data themselves. They are just another abstraction for the thing that actually holds data. Any action you do on data such as INSERTS, DELETES, indexing and searching run against an Index. Indices can belong to exactly one cluster and are comprised of Shards.

Shards are instances of Apache Lucene. A shard can hold many Documents. Shards are what does the actual data storage, indexing and searching. A shard belongs to exactly one node and index. There are two types of shards: primary and replica. These are mostly the exact same. They hold the same data, and searches run against all shards in parallel. Of all the shards that hold the same data, one is the primary shard. This is the only shard that can accept indexing requests. Should the node that the primary shard resides on die, a replica shard will take over and become the primary. Then, ES will create a new replica shard and copy the data over.

At the end of the day, we end up with something like this:

A more in-depth look at Elasticsearch

If you want to run a system, it is my belief that you need to understand the system. In this section I will explain the parts of Elasticsearch I belief you should understand if you want to manage it in production. This will not have any recommendations in it those come later. Instead it aims purely at explaining necessary background.

Quorum

It is very important to understand that Elasticsearch is a (flawed) democracy. Nodes vote on who should lead them, the master. The master runs a lot of cluster-management processes and has the last say in many matters. ES is a flawed democracy because only a subclass of citizens, the master-eligible nodes, are allowed to vote. Master-eligible are all nodes that have this in their configuration:

node.master: true

On cluster start or when the master leaves the cluster, all master-eligible nodes start an election for the new master. For this to work, you need to have 2n+1 master-eligible nodes. Otherwise it is possible to have a split-brain scenario, with two nodes receiving 50% of the votes. This is a split brain scenario and will lead to the loss of all data in one of the two partitions. So don’t have this happen. You need 2n+1 master-eligible nodes.

How nodes join the cluster

When an ES process starts, it is alone in the big, wide world. How does it know what cluster it belongs to? There are different ways this can be done. However, these days the way it should be done is using what is called Seed Hosts.

Basically, Elasticsearch nodes talk with each other constantly about all the other nodes they have seen. Because of this, a node only needs to know a couple other nodes initially to learn about the whole cluster.

//EDIT, 25th Feb 2020:
Dave Turner mentioned in the comments:

This isn’t really a constant process: nodes only share information about other nodes they have discovered when they’re not part of a cluster. Once they’ve joined a cluster they stop this and rely on the cluster’s elected master to share any changes as they occur, which saves a bunch of unnecessary network chatter. Also in 7.x they only really talk about the master-eligible nodes they have seen — the discovery process ignores all the master-ineligible nodes.

davecturner

Lets look at this example of a three node cluster:

Initial state.

In the beginning, Node A and C just know B. B is the seed host. Seed hosts are either given to ES in the form of a config file or they are put directly into elasticsearch.yml.


Node A connects and exchanges information with B

As soon as node A connects to B, B now knows of the existence of A. For A, nothing changes.

Node C connects and shares information with B

Now, C connects. As soon as this happens, B tells C about the existence of A. C and B now know all nodes in the cluster. As soon as A connects to B again, it will also learn of the existence of C.

Segments and segment merging

Above I said that shards store data. This is only partially true. At the end of the day, your data is stored on a file system in the form of.. files. In Lucene, and with that also Elasticsearch, these files are called Segments. A shard will have between one and multiple thousand segments.

Again, a segment is an actual, real file you can look at in the data directory of your Elasticsearch installation. This means that using a segment is overhead. If you want to look into one, you have to find and open it. That means if you have to open many files, there will be a lot of overhead. The problem is that segments in Lucene are immutable. That is fancy language for saying they are only written once and cannot be changed. This in turn means that every document you put into ES will create a segment with only that single document in it. So clearly, a cluster that has a billion documents has a billion segments which means there are a literal billion files on the file system, right? Well, no.

In the background, Lucene does constant segment merging. It cannot change segments, but it can create new ones with the data of two smaller segments.

This way, lucene constantly tries to keep the number of segments, which means the number of files, which means the overhead, small. It is possible to force this process by using a force merge.

Message routing

In Elasticsearch, you can run any command against any node in a cluster and the result will be the same. That is interesting because at the end of the day a document will live in only one primary shard and its replicas, and ES does not know where. There is no mapping saying a specific document lives in a specific shard.

If you are searching, then the ES node that gets the request will broadcast it to all shards in the index. This means primary and replica. These shards then look into all their segments for that document.

If your are inserting, then the ES node will randomly select a primary shard and put the document in there. It is then written to that primary shard and all of its replicas.

//EDIT, 25th Feb 2020:
Dave Turner mentioned in the comments:

The word “shard” is ambiguous here and although I think you understand this correctly it’d be easy for a reader to misinterpret how this is written. Each search only fans out to one copy of each shard in the index (either a primary or a replica, but not both). If that fails then the search will of course try again on another copy, but that’s normally very rare

davecturner

So how do I run Elasticsearch in production?

Finally, the practical part. I should mention that I managed ES mostly for logging. I will try to keep this bias out of this section, but will ultimately fail.

Sizing

The first question you need to ask and subsequently answer yourself, is about sizing. What size of ES cluster do you actually need?

RAM

I am talking about RAM first, because your RAM will limit all other resources.

Heap

ES is written in Java. Java uses a heap. You can think of this as java-reserved memory. There is all kind of stuff that is important about heap which would triple this document in size so I will get down to the most important part which is heap size.

Use as much as possible, but no more than 30G of heap size.

Here is a dirty secret many people don’t know about heap: every object in the heap needs a unique address, an object pointer. This address is of fixed length, which means that the amount of objects you can address is limited. The short version of why this matters is that at a certain point, Java will start using compressed object pointers instead of uncompressed ones. That means that every memory access will have additional steps involved and be much slower. You 100% do not want to get over this threshold, which is somewhere around 32G.

I once spend an entire week locked into a dark room doing nothing else but using esrally to benchmark different file systems, heap sizes, FS and BIOS settting combinations of Elasticsearch. Long story short here is what it had to say about heap size:

Index append latency, lower is better

The naming convention is fs_heapsize_biosflags. As you can see, starting at 32G of heap size performance suddenly starts getting worse. Same with throughput:

Index append median throughput. Higher is better.

Long story short: use 29G of RAM or 30 if you are feeling lucky, use XFS, and use hardwareprefetch and llc-prefetch if possible.

FS cache

Most people run Elasticsearch on Linux, and Linux uses RAM as file system cache. A common recommendation is to use 64G for your ES servers, with the idea that it will be half cache, half heap. I have not tested FS cache. However, it is not hard to see that large ES clusters, like for logging, can benefit greatly from having a big FS cache. If all your indices fit in heap, not so much.

//EDIT, 25th Feb 2020:
Dave Turner mentioned in the comments:

Elasticsearch 7.x uses a reasonable amount of direct memory on top of its heap and there are other overheads too which is why the recommendation is a heap size no more than 50% of your physical RAM. This is an upper bound rather than a target: a 32GB heap on a 64GB host may not leave very much space for the filesystem cache. Filesystem cache is pretty key to Elasticsearch/Lucene performance, and smaller heaps can sometimes yield better performance (they leave more space for the filesystem cache and can be cheaper to GC too).

davecturner

CPU

This depends on what you are doing with your cluster. If you do a lot of indexing, you need more and faster CPUs than if you just do logging. For logging, I found 8 cores to be more than sufficient, but you will find people out there using way more since their use case can benefit from it.

Disk

Not as straightforward as you might think. First of all, if your indices fit into RAM, your disk only matters when the node is cold. Secondly, the amount of data you can actually store depends on your index layout. Every shard is a Lucene instance and they all have memory requirement. That means there is a maximum number of shards you can fit into your heap. I will talk more about this in the index layout section.

Generally, you can put all your data disks into a RAID 0. You should replicate on Elasticsearch level, so losing a node should not matter. Do not use LVM with multiple disks as that will write only to one disk at a time, not giving you the benefit of multiple disks at all.

Regarding file system and RAID settings, I have found the following things:

  • Scheduler: cfq and deadline outperform noop. Kyber might be good if you have nvme but I have not tested it
  • QueueDepth: as high as possible
  • Readahead: yes, please
  • Raid chunk size: no impact
  • FS block size: no impact
  • FS type: XFS > ext4

Index layout

This highly depends on your use case. I can only talk from a logging background, specifically using Graylog.

Shards

Short version:

  • for write heavy workloads, primary shards = number of nodes
  • for read heavy workloads, primary shards * replication = number of nodes
  • more replicas = higher search performance

Here is the thing. If you write stuff, the maximum write performance you can get is given by this equation:

node_throughput*number_of_primary_shards

The reason is very simple: if you have only one primary shard, then you can write data only as quickly as one node can write it, because a shard only ever lives on one node. If you really wanted to optimize write performance, you should make sure that every node only has exactly one shard on it, primary or replica, since replicas obviously get the same writes as the primary, and writes are largely dependent on disk IO. Note: if you have a lot of indexing this might not be true and the bottleneck could be something else.

If you want to optimize search performance, search performance is given by this equation:

node_throughput*(number_of_primary_shards + number_of_replicas)

For searching, primary and replica shards are basically identical. So if you want to increase search performance, you can just increase the number of replicas, which can be done on the fly.

Size

Much has been written about index size. Here is what I found:

30G of heap = 140 shards maximum per node

Using more than 140 shards, I had Elasticsearch processes crash with out-of-memory errors. This is because every shard is a Lucene instance, and every instance requires a certain amount of memory. That means there is a limit for how many shards you can have per node.

If you have the amount of nodes, shards and index size, here is how many indices you can fit:

number_of_indices = (140 * number_of_nodes) / (number_of_primary_shards * replication_factor)

From that and your disk size you can easily calculate how big the indices have to be

index_size = (number_of_nodes * disk_size) / number_of_indices

However, keep in mind that bigger indices are also slower. For logging it is fine to a degree but for really search heavy applications, you should size more towards the amount of RAM you have.

Segment merging

Remember that every segment is an actual file on the file system. More segments = more overhead in reading. Basically for every search query, it goes to all the shards in the index, and from there to all the segments in the shards. Having many segments drastically increases read-IOPS of your cluster up to the point of it becoming unusable. Because of this it’s a good idea to keep the number of segments as low as possible.

There is a force_merge API that allows you to merge segments down to a certain number, like 1. If you do index rotation, for example because you use Elasticsearch for logging, it is a good idea to do regular force merges when the cluster is not in use. Force merging takes a lot of resources, and will slow your cluster down significantly. Because of this it is a good idea to not let for example Graylog do it for you, but do it yourself when the cluster is used less. You definitely want to do this if you have many indices though. Otherwise, your cluster will slowly crawl to a halt.

Cluster layout

For everything but the smallest setups it is a good idea to use dedicated master-eligible nodes. The main reasons is that you should always have 2n+1 master-eligible nodes to ensure quorum. But for data nodes you just want to be able to add a new one at any time, without having to worry about this requirement. Also, you don’t want high load on the data nodes to impact your master nodes.

Finally, master nodes are ideal candidates for seed nodes. Remember that seed nodes are the easiest way you can do node discovery in Elasticsearch. Since your master nodes will seldomly change, they are the best choice for this, as they most likely already know all other nodes in the cluster.

//EDIT, 25th Feb 2020:
Dave Turner mentioned in the comments:

Master-eligible nodes are the only possible candidates for seed nodes in 7.x, because master-ineligible nodes are ignored during discovery.

davecturner

Master nodes can be pretty small, one core and maybe 4G of RAM is enough for most clusters. As always, keep an eye on actual usage and adjust accordingly.

Monitoring

I love monitoring, and I love monitoring Elasticsearch. ES gives you an absolute ton of metrics and it gives you all of them in the form of JSON, which makes it very easy to pass into monitoring tools. Here are some helpful things to monitor:

  • number of segments
  • heap usage
  • heap GC time
  • avg. search, index, merge time
  • IOPS
  • disk utilization

Conclusion

After around 5 hours of writing this, I think I dumped everything important about ES that is in my brain into this post. I hope it saves you many of the headaches I had to endure.

Resources

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html
https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-discovery-quorums.html
https://github.com/elastic/rally
https://tech.ebayinc.com/engineering/elasticsearch-performance-tuning-practice-at-ebay/

19 thoughts on “In depth guide to running Elasticsearch in production

  1. Elastic Search it is a complete headache to run in production. Very expensive to maintain and has a giant compute cost. I think everyone that is considering to use in your platform it should think twice. After all, full text search is only a strong requisite for some applications.

    Liked by 1 person

  2. Hi,

    Nice article! good overview of the most important concepts. I was not aware of the ~30GB limitation, good to know!

    One question: Are you sure that the eligible “master nodes formula” 2n+1 is and not 2/n+1?

    Keep it up!

    Cheers

    Like

  3. Hey Mattis, thanks for the nice article. A 5-hour braindump is always an interesting format as it shows what’s in the front of your mind. Great to hear you’re using tools like `esrally` to get the most out of your cluster. I have a few observations specific to the 7.x versions that I think you’re mainly focussing on here:

    > Otherwise it is possible to have a split-brain scenario, with two nodes receiving 50% of the votes. This is a split brain scenario and will lead to the loss of all data in one of the two partitions. So don’t have this happen. You need 2n+1 master-eligible nodes.

    This is a common misunderstanding but it isn’t true: there is no elevated risk of data loss if you have an even number of master-eligible nodes. The consequence of two master-eligible nodes each receiving half of the votes is that that round of voting fails to elect either node and (in 7.x) usually the two tied nodes race to win the next voting round within a few hundred milliseconds. Although this situation doesn’t put your data at risk, [Elasticsearch 7.x still puts some effort into avoiding this kind of tie](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-voting.html#_even_numbers_of_master_eligible_nodes) to help elections complete more quickly.

    > Basically, Elasticsearch nodes talk with each other constantly about all the other nodes they have seen.

    This isn’t really a constant process: nodes only share information about other nodes they have discovered when they’re not part of a cluster. Once they’ve joined a cluster they stop this and rely on the cluster’s elected master to share any changes as they occur, which saves a bunch of unnecessary network chatter. Also in 7.x they only really talk about the master-eligible nodes they have seen — the discovery process ignores all the master-ineligible nodes.

    > master nodes are ideal candidates for seed nodes

    Master-eligible nodes are the _only_ possible candidates for seed nodes in 7.x, because master-ineligible nodes are ignored during discovery.

    > A common recommendation is to use 64G for your ES servers, with the idea that it will be half cache, half heap.

    Elasticsearch 7.x uses a reasonable amount of direct memory on top of its heap and there are other overheads too which is why [the recommendation is a heap size no more than 50% of your physical RAM](https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html). This is an upper bound rather than a target: a 32GB heap on a 64GB host may not leave very much space for the filesystem cache. Filesystem cache is pretty key to Elasticsearch/Lucene performance, and smaller heaps can sometimes yield better performance (they leave more space for the filesystem cache and can be cheaper to GC too).

    > If you are searching, then the ES node that gets the request will broadcast it to all shards in the index. This means primary and replica.

    The word “shard” is ambiguous here and although I think you understand this correctly it’d be easy for a reader to misinterpret how this is written. Each search only fans out to _one_ copy of each shard in the index (either a primary or a replica, but not both). If that fails then the search will of course try again on another copy, but that’s normally very rare.

    Liked by 1 person

    1. Hi Dave,

      thanks very much for your comment, I added that information to the post. And my thanks to the esrally team, once you get that thing running it is super helpful.

      I have not used ES 7 yet but definitely have seen an ES 2.x cluster run into split brain scenario where we ended up having two clusters with one node each, and we had to kill one node and lose all that data. I am sure this has gotten better since but I will keep that information in the post unedited.

      Like

  4. Thank you for this interesting read. I got a bit surprised on your talk about the problems of having an even number of masters. To me it looks like if that was true then it doesn’t matter having 3 masters or 11, as a failure of one of them would mean having 2 or 10 masters, with the same chance of breaking democracy.
    Also, I would also like to see how interesting it is to install ES on kubernetes cluster.
    Thanks!

    Like

  5. Easily the best knowledge dump I’ve seen on running Elastic in production. It seems like you too have noticed there are plenty of sources explaining what you can do, but few explaining what you should do.

    Thank you!

    Like

  6. Great post, thank you for taking the time to write it.

    Have you looked at managed Elasticsearch solutions? Would you be interested in writing a similar article about your Graylog experience?

    Like

  7. Hey Mattis,

    this was a great read. However, I must say that the part about 32GB heap and compressed oops is not quite right. I mean, it is true that there will be a degradation of performance but not due to JVM using compressed oops beyond 32GB but because it will stop using them. Beyond that threshold, JVM switches to 64bit references (uncompressed oops) which increases the sizes of objects (big overhead for gc and CPU cache).

    Like

  8. I am not certain about your assumption about the sizing of the nodes for indices. You write:

    number_of_indices = (140 * number_of_nodes) / (number_of_primary_shards * replication_factor)

    which, with 8 nodes would be around 280 indices, your reasoning is, that the nodes were crashing with more.

    I currently have a cluster running with around 900 (!) indices, and just 15 GB (!) of heap per node – it’s running fine and I haven’t had any issues with crashes of ES.

    Might there be also an impact of the index / shard size? My indices are around 3 – 10 GB each for a total data size of around 4.5 TB.

    I am also running ES on kubernetes: issues I have encountered so far are mainly about discovery (solved via dedicated master nodes) and split brain (also fixed with dedicated master nodes), I would like to stress, that this actually DOES NOT break all your data, as long as not both cluster sides are indexing new data, and you can catch it before ES starts to promote Replicas to Primaries (might not even be an issue then).

    Regardless of that I have found that ES runs best on SSD like storage, especially restart times are drastically improved.

    Restart on a NFS connected node -> 2 – 4 h with around 1400 shards
    Restart on a local ssd node -> 2 – 3 minutes with around 900 shards

    And I would like to add the nice little setting:

    “cluster.routing.allocation.enable”: “new_primaries”

    vs

    “cluster.routing.allocation.enable”: “all”,

    when doing rolling restarts, this prevents any funny business with shard reallocation.

    Like

  9. “30G of heap = 140 shards maximum per node”
    I don’t think this paints a complete picture. Shouldn’t this depend on the size of the shards and the types of the queries that are being performed on the cluster as well?

    Like

  10. Hi Mattis,

    I am using ES in production for the past few years and was interested to hear some of your recommendations. I do have some questions about some of your insights.

    1. We use very large clusters containing thousands of indices and tens of thousands of shards. We never had issues that directly correlate to the number of shards or indices. Could you please elaborate more on the following calculation and how you’ve reached it?
    “number_of_indices = (140 * number_of_nodes) / (number_of_primary_shards * replication_factor)”=D

    2. In the Disk section, you talk about filesystem settings and configurations that are better for performance such as the scheduler. Do you have any graphs or extra details regarding the benefits of each configuration? It’s difficult for us to make changes to our systems and I want to understand the impact we can achieve.

    Thank you 🙂

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: