Understanding the Usage and Importance of Rate Functions

rate and irate are some of the most commonly used functions in Prometheus. They are used to understand the rate of change of a metric over time. Both functions are independent of the data source and can be used with any counter metric. SRE engineers, monitoring teams, and developers use rate and irate to analyze the behavior of metrics over time, helping them monitor system performance and troubleshoot issues.

The rate function calculates the average rate of change over a specified time window; its purpose is to smooth out short-term fluctuations in the data and provide a more stable view of trends. This is particularly useful for alerting and dashboards where stable, reliable metrics are needed.

On the other hand, irate provides a more immediate view of changes by calculating the rate over the most recent interval, which makes it useful for identifying sudden spikes or drops in fast-moving metrics.

If we take Google's monitoring strategy as an example, it focuses on four key metrics known as the Four Golden Signals. The rate function is used extensively to calculate the rate of change of metrics like latency, error rate, and traffic. Here are some simple examples.

Traffic

rate is directly used to measure the volume of requests over time, such as HTTP requests or database queries.

For example, the following query calculates the rate of HTTP requests per second, grouped by both instance (the target, e.g., a server or application) and method (e.g., GET, POST).

sum by (instance, method) (rate(http_requests_total[5m]))

Errors

rate helps to monitor the rate of errors over time, such as HTTP 500 responses or failed jobs. This enables tracking the health of services and identifying issues quickly.

For example, this query shows the total rate of HTTP 500 errors for each instance. It can be used to identify which instances are experiencing the most server errors and may need attention.

sum by (instance) (rate(http_requests_total{status="500"}[5m]))

Latency

Latency typically involves histogram buckets. The rate function can be used to calculate the per-second rate of observations within each histogram bucket over a time window.

For example, the following query calculates the 95th percentile latency (response time) for requests over the last 5 minutes using histogram data:

histogram_quantile(
    0.95,
    sum(rate(request_duration_seconds_bucket[5m])
) by (le))

Observability with Prometheus and Grafana

A Complete Hands-On Guide to Operational Clarity in Cloud-Native Systems

Enroll now to unlock all content and receive all future updates for free.

Unlock now $36.99 Learn More

Previous Next