Feedback

Chat Icon

Observability with Prometheus and Grafana

A Complete Hands-On Guide to Operational Clarity in Cloud-Native Systems

Sharding and Federation: Scaling Strategies for Prometheus
90%

Sharding

Sharding is a horizontal scaling technique for Prometheus that involves splitting the metrics scraping load across multiple Prometheus instances. Each instance, or shard, is responsible for scraping a subset of the total metrics, effectively distributing the load and enabling Prometheus to handle significantly larger environments.

How Sharding Works

Sharding in Prometheus involves dividing scrape targets, which are the endpoints providing metrics, among multiple Prometheus instances. Each shard is configured to scrape only the targets assigned to it. This approach prevents overloading any single Prometheus instance while enabling the system to handle a larger number of targets effectively.

Each Prometheus instance operates independently, maintaining its own local storage and state. This independence means that the shards are unaware of one another, and there is no built-in mechanism for inter-shard communication or coordination. This design simplifies the operation of individual shards but necessitates external tools for certain cross-shard functionalities.

Advantages and Limitations of Sharding

Performance, scalability, and fault isolation are the primary benefits of sharding in Prometheus:

  • Improved Performance: Each Prometheus instance handles a smaller number of targets; as a result, the overall system performance improves by sharing the load.

  • Scalability: Sharding enables Prometheus to scale horizontally by adding more instances to accommodate growing workloads. Once the process is mastered, adding new shards is relatively straightforward.

  • Fault Isolation: Issues in one shard do not affect others. This isolation prevents a single shard from causing a system-wide failure.

The main limitations of sharding in Prometheus relate to the lack of built-in support to query across shards. Imagine you have two Prometheus instances, each scraping a different set of targets. If you want to query data from both instances, you need an external tool like Thanos to aggregate the results.

To perform global queries or aggregate data across shards, an external query layer is typically required. Tools such as Thanos or Cortex are commonly used for this purpose. These tools provide the ability to query data across multiple Prometheus instances.

To add a Thanos central query component on top of sharded Prometheus instances, you need to configure each Prometheus instance to run a Thanos Sidecar. The Sidecar component exposes the Prometheus data to the Thanos Querier, which aggregates data from all shards. More details on setting up Thanos can be found in the Thanos documentation.

Configuring Sharding

Sharding can be implemented manually or through automation tools. A common approach is to use external systems to distribute scrape targets. Tools like Prometheus Operator or custom scripts can automate the process of distributing targets among shards. A hash function could also assign scrape targets to specific shards based on their names or labels. You can, for instance, use the hashmod function in Prometheus relabeling rules to distribute targets based on a hash of their labels. This is the most common and straightforward way to shard Prometheus.

In our example, we have a pool of 4 servers sending a large number of metrics to a single Prometheus instance. Adding a second Prometheus instance and configuring sharding can help distribute the scraping workload across the two Prometheus. There's nothing special to do on the service exposing the metrics; the sharding is done on the Prometheus side. Here are the configurations:

First Prometheus instance:

global:
  external_labels:
    shard: 0

scrape_configs:
  - job_name: 'my_heavy_loaded_service'
    static_configs:
      - targets:
          - '10.135.0.5:5000'
          - '10.135.0.6:5000'
          - '10.135.0.7:5000'
          - '10.135.0.8:5000'
    relabel_configs:
      # Hashmod based on the target address
      - action: hashmod
        source_labels: [__address__]
        modulus: 2 # Number of shards
        target_label: __tmp_hash
      # Assign targets to shard 0
      - action: keep
        source_labels: [__tmp_hash]
        regex: '0'

Second Prometheus instance:

global:
  external_labels:
    shard: 1

scrape_configs:
  - job_name: 'my_heavy_loaded_service'
    static_configs:
      - targets:
          - '10.135.0.5:5000'
          - '10.135.0.6:5000'
          - '10.135.0.7:5000'
          - '10.135.0.8:5000'
    relabel_configs:
      # Hashmod based on the target address
      - action: hashmod
        source_labels:

Observability with Prometheus and Grafana

A Complete Hands-On Guide to Operational Clarity in Cloud-Native Systems

Enroll now to unlock all content and receive all future updates for free.