A queue is a simple system:
- producers place messages into the queue;
- consumers remove messages from the queue;
- whenever the produce rate is higher than the consume rate for a extended period of time you have queue overload. By overload, I mean the queue either becomes full or is growing without stopping.
In order to identify overload you need to monitor queue latency. Queue latency tells you how long messages are waiting in the queue. For example, if the typical queue latency for a system is 1 second, and suddenly queue latency is at 60 seconds and growing, then you have queue overload.
To understand the root cause of queue overload I use the following process:
- Understand when queue latency started to increase. You need to know when the anomaly started so you can correlate with other metrics.
- Understand if the problem is on the producer or on the consumer.
- To understand if the problem is on the producer you need to assess if an increase in produce rate occurred shortly before the queue latency started to spike.
- To understand if the problem is on the consumer you need to assess if if a decrease in consume rate occurred shortly before the queue latency started to spike.
- Now that the issue has been localized (i.e. consumer or producer) dig into additional relevant metrics
- For the producer, you need to look into the metrics in upstream systems that lead to message production
- For the consumer, you need to look for changes in consumer processing time (and what may affect it) or changes in the amount of workers reading from the queue.
The diagram below summarizes this process.
The core idea of this root cause analysis process is step 2 where we bisect the problem in order to save time. This approach starts by focusing on high level metrics that provide evidence of the location of the problem and then narrowing down on specific metrics. The alternative to this process – looking at every specific metric to find the root cause of a problem – usually takes longer.
Queue overload causes incidents in production. Effective management of incidents requires being able to quickly make decisions to restore service. By understanding the core flows that affect a system you can make decisions faster.
I can send you my posts straight to your e-mail inbox if you sign up for my newsletter.