Queues are popular building blocks in web applications. They are used to send messages between different systems (e.g. using Kafka or RabbitMQ). They are used to schedule work in background job processing frameworks (e.g. Sidekiq, Resque). They are used by web app servers (e.g. Unicorn) to briefly stage requests during overload.
They are the backbone of many applications and need to be monitored like any other component. Effective collection of queue metrics can help identify an incident early so you can repair the system quicker. It can also help you to quickly identify the root cause of an incident.
These are the metrics that I find valuable to track for queues:
- Queue latency – how long a message spends waiting on the queue to be processed. This should be measured using percentiles (e.g. 95p). Queue latency should be as low as possible. This metric is one of the most clear indicators of issues on a queueing system.
- Queue size – the number of messages on the queue. Queue size helps inform if the queue’s limits are being reached since queues are always bounded in some way (e.g. disk space, memory). Typically when there is queue overload, queue size tends to grow.
- Produce rate – the rate at which messages are placed on the queue. Most of incidents in queue systems are caused by prolonged mismatches between produce and consume rate.
- Consume rate – the rate at which messages are removed from the queue. Most of incidents in queue systems are caused by prolonged mismatches between produce and consume rate.
- Message Processing Latency – how long it takes for a consumer to process a message. This should be measured using percentiles (e.g. 95p). Processing latency affects the consume rate.
I can send you my posts straight to your e-mail inbox if you sign up for my newsletter.