In depth guide to running Graylog in production

In my last post, In depth guide to running Elasticsearch in production, I talked about common pitfalls and things you ought to know when it comes to run Elasticsearch. My main experience with ES is as a backend to Graylog. So in this post, I will try to write down everything I know about that. Like before, I do not make any guarantees for completeness. I will also not talk about how to install Graylog. This is covered well in the official documentation an other places.

The basics: Inputs, Extractors, Streams, Pipelines, Index Sets, Buffers and the Journal

Before I begin writing down my experiences with Graylog, I want to go through the basic architecture and nomenclature.

Inputs are listening services that you can send logs to. Graylog supports many different input formats such as Syslog, Beats, GELF, etc. From a user perspective, every input opens a port on your Graylog server that you can then send logs of this particular format to.

Extractors are logically part of an input. Architecturally they are part of the ProcessBufferProcessor. Every extractor of an input runs on every message received on that input. There are ways to limit this, which I will get to later The goal of extractors is to get around a common issue of log formats such as Syslog. Syslog messages have some pre-defined fields such as facility, severity, date, etc. but that does not help you if you want to show all HTTP requests which took longer than 1000 ms, because ‘time to render’ is not a field defined in the syslog RFC. Graylogs solution to this problem is to allow you to extract information like this into custom fields, which are then stored as fields in Elasticsearch and allow you to do searches like

source:web-prod* AND time_to_render:>1000

Streams allow you to route messages into a category in real time based on rules you can set. Streams can be used for many applications, the most common of which is to store specific messages in specific index sets. Other cool things you can do with streams include alerts, outputs and processing pipelines.

Pipelines attach to one or multiple streams and run custom code on every message in that stream. Common tasks are to normalize the names of fields or filter out specific messages.

Index Sets are sets of Elasticsearch Indices that Graylog uses to store its data in. They are sets because they are composed of one to many individual ES indices. The indices all share a common prefix and then have a number that simply counts up. If you store a lot of data this mechanism is necessary, because it limits the amount of data that is stored in an index. The more data you store in an ES index, the slower the searching will be. Furthermore, you can also use index sets to limit the total amount of data stored.

Buffers are an important part of Graylog. As it so happens, the amount of logs that arrives at a given time can fluctuate wildly, while the processing capacity of Graylog stays constant. So there must be some way to buffer messages at various parts of the Graylog architecture. There are three main buffers: Input, Process and Output buffer. You can check the state of buffers under system -> nodes, under a specific node. These buffers are implemented as ring buffers using LMAX Disruptor, an open-source library written by a high-speed trading company.

Journals are implemented using Apache Kafka. Like buffers, they temporarily store messages. Unlike buffers, they do so on disks and can in fact store multiple million messages. Their two purposes are to keep messages safe in case the server crashes, and to store them temporarily if an output, usually Elasticsearch, does not accept messages fast enough.

A more in-depth look at Graylog

Unfortunately the Graylog documentation is somewhat opaque on architectural details and the best insight into Graylogs architecture today is a whiteboard sketch you can find here. I cannot code Java and I did not look into Graylogs source code but here is what I gathered about how Graylog actually works under the hood.

As mentioned above, Graylog buffers messages in ring buffers. If you don’t know what a ring buffer is look at this array:

my_array = [ 1, 2, 3, 4 ]

And now imagine you connect the end of the array, so my_array[n] to the first part of the array, my_array[0] using pointers. So this way, if you counted up the array and arrived at n, you would actually read the first entry again. Ring buffers have cool properties but in practical terms that means they have a limited capacity and when that is full, the oldest entries will get overwritten by new ones.

Graylog places a ring buffer in front of every important step in its architecture. We have the input, processing, and output buffer. All of these have limited size and all of these, ideally, should fit into the L3 cache of your CPU. The size of the ring buffer also has to be a power of two. The default here is 65536.

Processors in Graylog are the processes that do the actual work. Ideally, they want to have a core for themselves so it does not make sense to raise the total amount of processors higher than the amount of cores in your system.

There are three kind of processors: InputBufferProcessor, ProcessingBufferProcessor and OutputBufferProcessor. Every one of these processors has the aforementioned ring buffer in front of it.

The InputBufferProcessor is not described anywhere. I will hazard a guess that it handles the initial conversion from a raw log message into the internal Graylog messaging format (‘message’).

The ProcessBufferProcessor does most of the heavy lifting including extractors and stream routing. This will usually be your bottleneck and the part you should tweak.

The OutputBufferProcessor handles the writing of messages to outputs. Normally this will be Elasticsearch. So increasing the number of OutputBufferProcessors will not necessarily increase throughput, you will just get more parallel connections to Elasticsearch. This could actually be a bad thing if ES cannot handle it.

So how do I run Graylog in production?

Architectural considerations

If you are reading this blog, you are probably at a size where you want to deploy a multi-node setup of Graylog. If you do this there are a couple things to keep in mind. This will not touch on Elasticsearch as I talked about that in detail in my previous post.

MongoDB

This NoSQL database is what Graylog uses to store all of its settings. MongoDB, like Elasticsearch, uses a quorum-based algorithm to decide which server is the primary. This means you want 2n+1 MongoDB servers. Since Graylog settings do not change very often, it is safe to say that 3 MongoDB nodes should be sufficient.

Make sure to use a MongoDB replica set, enable authentication and encryption, and only allow connections from other MongoDB nodes as well as Graylog.

Master

One of your Graylog servers is the master. This is a setting in the config files, and does not happen automatically. The cluster can survive without a master node for a while, but not indefinitely. I have not found documentation about what exactly the master node actually does.

Load balancing

You want to load balance web and API requests to the servers, but more importantly, you want to balance logging inputs. This is currently only really possible using nginx, since HAProxy does not yet support UDP load balancing.

Resources and tuning

Honestly, there is not much to say here. You need around 8-12 G of RAM, up to 10 G of which should be heap for Graylog. Too much heap is not really a good thing since it increases garbage collection time.

You want a number of cores dependent on your workload with a minimum of four. One for the OS, and one for each processor:

processor_wait_strategy  = blocking
inputbuffer_processors   = 1
processbuffer_processors = 1
outputbuffer_processors  = 1

If you see a buffer fill up, you should give your node one more core and bump the processor in question up by one. Usually, this will be the ProcessBufferProcessor.

If your output buffer fills up, you should first check if Elasticsearch might be the bottleneck, since increasing the number of OutputBufferProcessors increases the number of simultaneous connections made to ES.

Theoretically you can increase the size of your ring buffers and the Graylog documentation helpfully states

Raise this if raising outputbuffer_processors does not help anymore.

https://docs.graylog.org/en/3.2/pages/configuration/server.conf.html

However, I question the validity of this. If your buffers fill up on a regular basis it means your processors cannot do work fast enough. The only solution is to use more processors or reduce the amount of work. Increasing ring buffer size only helps if you often have very bursty situations.

Inputs

There are four main issues you have to be aware of when using Inputs in production: network protocols, logging protocols, reverse DNS and encryption.

Network protocols

First of all, pretty much all inputs support running either on TCP or UDP. GELF also supports HTTP. As always, you have to be aware of the pros and cons here. UDP is faster, since it does not require a handshake, but it does not guarantee that your message arrives, and also it doesn’t support encryption. Finally, HAProxy cannot load balance UDP currently (it might when HTTP/3 is released).

However, a quirk of the Graylog architecture is that a message can arrive successfully but does not get stored anyway. This is due to the fact that if you get a 200 let’s say from a GELF HTTP input, it just means that the input has gotten the message. It does not mean that the message has been processed and stored. An error in one of those steps will still mean that the message is not available in Graylog.

As such I will not be able to tell you what to do here. If the only reason you want to use TCP is transmission guarantee, then in my mind that is not a good enough reason since just because a message arrives it does not mean it is processed correctly by Graylog. If your organization requires encryption for log messages, which is definitely a good idea, you will have to use TCP. Personally, I have not run into performance issues from the use of TCP over UDP. Also keep in mind you want to be able to load balance logging inputs.

Logging protocols

I am just going to come out and say it: GELF is better than syslog. The reason for that is that GELF is just JSON and as such supports arbitrary fields. This in turn means you should not need extractors on a properly used GELF input which in turn will speed up message processing significantly. As we will see, extractors can become a major bottleneck and pain point in larger Graylog installations, so this is a huge plus.

Reverse DNS

Syslog per default sends the IP of the machine as source. However, in Graylog, you get the hostname. The reason for that is that Graylog does a reverse DNS (rDNS) lookup against that IP to get the name of the machine – for every message that arrives. This can quickly result in tens of thousands of additional DNS queries per second.

My recommendation is this: configure syslog to send the hostname of the machine where possible. This depends on your logging server, in rsyslog you want something like

$PreserveFQDN on

and make sure that the hosts file has the correct entry for 127.0.0.1.

If you have clients that just cannot do this, install a DNS cache on the Graylog server, like dnsmasq. Even just a short TTL like 60 or 90 will significantly reduce the amount of DNS queries generated.

Encryption

Many logging protocols support encryption / authentication via TLS when you run them on TCP. You definitely should enable this if possible. This is very straight forward if you ever configured certificates. Just keep a couple things in mind:

  1. Certificate has to be X.509
  2. Private key has to be PKCS#8
  3. If you use multiple Graylog servers, make sure that the FQDN of the load balancer endpoint is a subject alternative name of the certificate.

Extractors

Extractors are amazing but can also become one of your biggest bottlenecks. Luckily, they can be optimized. To reiterate here is what happens on every message that arrives at an input that has extractors:

  1. for all extractors, check if the run condition is met
  2. for all matching extractors, run the extractor

Since extractors are usually written in regex (Grok just being a fancy version of regex), it is very easy to write bad ones. Also, since most people don’t actually understand the impact of extractors, you often get cases where the run condition is the same regex as the extractor itself. Since the run condition is always evaluated against all incoming messages, this can drastically reduce throughput of your logging system.

Imagine you have ten extractors taking 500 µs each for their condition. In this theoretical case a single input processor could handle

1s / (500 µs * 10) = 200 msg / s

This is a very extreme example that usually does not happen. In fact, 95th percentile condition time for an extractor is usually single-digit µs. But it is easy to see how things like this can add up:

In case you didn’t know, you can get metrics per extractor either via the API or under system -> inputs -> extractors -> details.

As a rule of thumb you should work on every extractor that meets one of these conditions:

  1. regex not containing ^ or $
  2. condition is ‘if field matches regular expression’ which is just a copy of #1
  3. condition is a regex not including ^ or $
  4. high runtime

Fixing bad regexes

I know, learning regex sounds scary, but there are some very simple steps you can do to speed them up. The best one is to use either ^ and / or $. ^ stands for ‘beginning of string’ and $ for ‘end of string’. If you do not use one of these, you are doing a full text search. Imagine you have a log message like this:

my-awesome-application hello world message: 22

If, in this case you wanted to extract the 22 into a field it might be tempting to do:

\d+

But this will do a full text search through the entire message. If you instead did:

\d+$

It will search from the end of the message and immediately discard all messages that do not have a digit at the end, a much faster process.

Fixing bad conditions

First of all, if your condition is the same as the regex, your condition is officially useless and you would be better off just removing it. If your main regex is fast, a condition is unlikely to improve the overall run time of your extractor. As such you should really only use conditions if the extractor would otherwise produce wrong data or if your extractor is very complex and the condition very simple. So if you had these two messages

my-awesome-application hello world message: 22
my-second-favorite-application hello world message: 44

And you wanted the 22 and 44 to go into different fields, you would need to use a condition like this:

^my-second-favorite-application

The second case, where your extractor is very complex but you are able to write a simple and fast condition for it, is much rarer. I have really only seen it with grok extractors where all messages have a common prefix, as seen in the code box above. In those cases, filtering by the prefix will indeed help with run time.

GROK extractors

You often have the case that you want to extract multiple fields from a single message. Many people build multiple extractors in this case, often with very generic regexes resulting in false-positives and slow overall performance. In these cases, it is usually better to invest time once to built one good extractor using grok, which allows you to populate many fields in one extractor.

Basically, grok is just a macro-language for regex. Instead of writing the regex from scratch, you define patterns that represent a regex. See this example from logstash:

MAC (?:%{CISCOMAC}|%{WINDOWSMAC}|%{COMMONMAC})
CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4})
WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2})
COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})
IPV4 (?<![0-9])(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))(?![0-9])
HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)
HOST %{HOSTNAME}
IPORHOST (?:%{HOSTNAME}|%{IP})
HOSTPORT %{IPORHOST}:%{POSINT}

You can, and should, install grok patterns in Graylog. You can do so under System -> Grok Patterns. You can then either define your own patterns, or import a pattern file, like this one from elastic.

Once you have done that you can use grok extractors in Graylog. As I said before, the great thing about them is that you can get all fields from a message using one extractor. This helps a lot with run time, but it also makes management much easier. For example, this parses iptables messages:

^%{DMESG_UPTIME}\s%{PROG:application_name}:\sIN=%{WORD:Input_int} OUT=%{DATA:Output_int} (?:MAC)=%{COMMONMAC:Sender_MAC}\:%{COMMONMAC:Receiver_MAC}\:[A-f0-9]{2}\:[A-f0-9]{2}\s+SRC=%{IP:SRCHost} DST=%{IP:DSTHost} %{GREEDYDATA:UNWANTED} PROTO=%{WORD:Protocol} SPT=%{NUMBER:SourcePort} DPT=%{NUMBER:DestinationPort}

Explaining grok is not the aim of this blog post, the really short version is this:

%{PATTERNAME:field_name}\s+ some regex here. I can also throw away stuff%{WORD:UNWANTED}

Index Sets

Chosing the right size

I would like to refer you to my previous post, In depth guide to running Elasticsearch in production. The size of an index depends on the amount of storage you have as well as the maximum amount of shards you can put on those nodes, which is explained in that article.

In most cases, you should always rotate indices by size. The reason is that this is the only way you can make sure you have enough space to store logs. If you set to rotate by time, you simply do not know how many GBs of data accumulate in that time frame.

The only exception that I have seen is data that has to be legally retained for longer, or that is small. One could for example chose to store logs of commands entered on servers for multiple years. This data is usually very small, so storing it for longer is not an issue.

Replication and shards

In general, and I know I go against other popular blogs here, always set index replicas to at least 1. Yes, even if you don’t care about the data. The reason is that Graylog will simply not write to a cluster in cluster state red. Cluster state is red when data loss has occurred. So if you have some index set without replication, and even one node that holds a shard of that dies, Graylog will stop writing to that index set altogether and manual intervention is required. Don’t let that happen.

For shards, the short answer is that the number of shards depends on how much data you will write. Logging is a write heavy workload, and every Elasticsearch node has a limited write capacity. If we only had one single primary shard, we can only get the write capacity of one node. But if we have n primary shards we can get the write capacity of n nodes. That means that for maximum performance, number of primary shards should equal

number_of_nodes / (index_replicas + 1)

If you set index replicas to 1, every primary shard will have one replica, so you are writing twice the amount of data.

Indexer failures

An important point that I think many people ignore are indexer failures. You can check those under System -> Overview -> Indexer Failures.

An indexer failure means that your message has not been stored in Elasticsearch. It can have gone through your inputs, extractors, streams, pipelines, etc. just to end up not being written. As such you have to tightly control indexer failures and take them seriously.

One common failure is that the maximum amount of fields of the ES index has been reached. Per default, an ES index can have 1000 fields. If you try to PUT a document with new fields into an index and the maximum amount of fields has already been reached, you will get this error. Yes, you can increase the maximum amount of fields but you really shouldn’t. Every field in ES takes up space, even when its empty although this behavior has gotten better since ES 6.

If you run into this problem, probably one of two things are happening:

  1. you store everything in one index set
  2. duplicate fields

If you run Graylog at a scale that requires you to read this blog post, you most likely want to have multiple index sets. How you divide these is up to you. You could divide them per application or per department. We have for example separate index sets for Debug logs, Windows, Firewall, Routing, CLI commands entered, Access logging (web-server etc.), big applications, and, most importantly, this:


Every application that does not understand Graylog and just generates new fields for every log message, gets send to the naughty index set, together with all other misbehaving applications. This way they only destroy logging for each other, and let the well-behaved applications log in peace.

Once you have multiple index sets, you can control what data is send where by using streams. Every stream allows you to set the index set to which the data is written.

Another problem that happens often, is that organizations have duplicate field names:

department
dept
department_name
render_time
rendertime
render_time_millis
render_time_seconds

This can very quickly use up a lot of fields. This is a management issue because there should be standards on how to name these things. There are two technical solutions for this. One, as already mentioned, sent different applications to different index sets. The other is to use processing pipelines to rename fields to a common standard.

Pipelines

Pipelines allow you to run code on your messages. Basically, a pipeline gets messages from one or many streams, and then sends them through a series of rules. Every rule is a piece of code you write yourself. This can be pretty simple stuff or really complicated. Not much to say here except to keep in mind that this runs on every message in a stream. That means it will slow throughput down. Doing things like renaming fields in Graylog is a crutch and the better way is to do that right in the application. In case that isn’t possible, you could for example use code like this to change field names around:

rule "standardize-juniper-fields"
when
  has_field("application_name") && (to_string($message.application_name) == "RT_FLOW" || to_string($message.application_name) == "RT_IDS")
then
  set_field("application_name", "firewall");
  rename_field("destination-address", "DSTHost");
  rename_field("source-address", "SRCHost");
  rename_field("destination-port", "DestinationPort");
  rename_field("source-port", "SourcePort");
  rename_field("source-address_city_name", "SRCHost_city_name");
  rename_field("source-address_country_code", "SRCHost_country_code");
  rename_field("source-address_geolocation", "SRCHost_geolocation");
end

Monitoring

Luckily, Graylog is written entirely as a REST API. This makes it very easy to monitor. You can see all available metrics under System -> Nodes -> Metrics. Here are a couple metrics which you should have an eye on:

system/metrics/org.graylog2.journal.utilization-ratio
# if this goes over a couple percent, you have some
# sort of issue with writing log messages to ES
# that you have to look into.
system/indexer/failures
# every indexer failure means a message not written
# as such, every indexer failure is an error that
# needs to be fixed
system/throughput
# this shows messages/sec. a good historical metric
# to have
system/metrics/org.graylog2.shared.buffers.ProcessBufferProcessor.processTime
# the mean as well as the percentiles of this are
# useful. If you have any sudden spike, someone
# added an extractor or a pipeline that is breaking
# graylog. also good to see if your extractor
# tuning does something
system/metrics/org.apache.logging.log4j.core.Appender.warn
system/metrics/org.apache.logging.log4j.core.Appender.error
system/metrics/org.apache.logging.log4j.core.Appender.fatal
# rate of graylog-server log messages that are 
# WARN ERROR or FATAL this is good to see if
# something is wrong internally

On top of that, like every Java program, use jstat to check heap and garbage collection data.

Help! Graylog does not write to Elasticsearch anymore

Go to System -> Nodes. Click on a node. Observe. Are the buffers full or is the journal filling up?

If the buffers are full then you need more ProcessBufferProcessors. If the journal is filling up you have an issue writing to Elasticsearch. In that case, check the output in the top right corner. Is it just a low number or does it constantly switch between 0 and whatever the bulk size is you set in Graylog? In the latter case your ES cluster is most likely not fast enough to handle writing the log messages. There is a back-pressure system build into ES, and Graylog will stop writing when ES is too busy.

You can also try to rotate the active write index. Sometimes Graylog does get stuck and is unable to write, rotating the write index fixes that.

If these steps do not help, restart graylog-server. We had a recurring bug which caused an entire Graylog node to stop processing messages. Our workaround was to automatically restart graylog-server whenever the journal utilization went over 3%.

If that still does not help, you might have a corrupted Kafka journal. In this case you will have to delete it, losing all messages within it:

service graylog-server stop
rm -rf /var/lib/graylog-server/journal
service graylog-server start

One thought on “In depth guide to running Graylog in production

  1. I’ve experimented with processbuffer_processors along with the other two related settings. Adjusting these so that their sum = the number of processor cores left me with underutilized cores. The VM was literally incapable of driving all cores anywhere near 100%. What made the best use of cores was to leave the input and outputbuffer at 2 and adjust the processbuffer to perhaps 1.2 to 1.4x the number of cores in the system.

    As a VMware guy, I dislike high core counts due to CPU Ready, co-stop, NUMA locality with “wide” VMs, etc. I’d rather keep the core count low and drive them hard.

    Like

Leave a Reply to RM Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: