Feedback

Chat Icon

Observability with Prometheus and Grafana

A Complete Hands-On Guide to Operational Clarity in Cloud-Native Systems

Strategies to Scale Prometheus: Remote Write and Agent Mode
92%

Remote Write

Remote write lets a Prometheus server stream samples it scrapes to external backends for long‑term storage, aggregation, or global querying.

Prometheus batches samples and sends them via HTTP POST to the receiver’s /api/v1/write endpoint using snappy‑compressed Protocol Buffers (per the remote write spec). The pipeline is near real time (queue‑based, with retries and backoff) rather than truly synchronous.

To achieve this, in prometheus.yml under remote_write, you configure some key parameters including:

  • Endpoints and optional auth (basic, OAuth, headers, TLS).
  • Queue/batching: shards, capacity, max samples per send, batch deadlines, backoff.
  • Relabeling: write_relabel_configs to keep/drop/transform series before they’re sent.

Samples keep their original labels and timestamps. Prometheus may add external_labels (if configured) and receivers may relabel on ingest.

Some common backends (receivers) that support Prometheus remote write include:

  • Thanos Receive (part of Thanos)
  • Cortex / Grafana Mimir
  • VictoriaMetrics
  • Others that implement the spec: e.g., Timescale/Promscale, InfluxDB’s RW endpoint, etc

Use Cases

The main use cases for Remote Write iclude offloading storage and enabling global querying. Specifically, it:

  • Push old or high-volume data out of Prometheus so your local storage doesn’t explode. Remember, Prometheus is that hot news reporter who's specialized in breaking news but not in archiving old stories.
  • Query everything from one place when the remote system has its own global querier. This is the case for tools like Thanos, Cortex or Mimir.
  • Keep scraping locally the same way; it only changes where the data gets stored and queried.

Advantages and Limitations

Remote Write offers several benefits for scaling Prometheus, mostly around how data is stored, transported, and handled after the scrape. The key advantages are:

  • Long-Term Data Retention: Offloads stored samples to a backend that can keep them for months or years without stressing Prometheus' local TSDB. Some teams may need to retain metrics for compliance, some others for historical analysis and comparison. Others just have long-term trends to monitor. Whatever the reason, remote write helps.

  • Centralized Metrics Management: Allows multiple Prometheus servers to stream their data into one backend. This approach simplifies the operational complexity by providing a unified view of metrics across different environments or clusters. You may, for example, have different cloud regions with their own Prometheus servers, all sending data to a single remote storage system.

  • Real-Time Data Streaming: Metrics are forwarded as they are scraped, so remote backends receive fresh data with minimal delay. "Real-time" here means seconds to a few minutes, depending on network conditions and backend performance. The delay is usually negligible for many monitoring use cases.

  • Elastic Storage and Query Capacity: External systems scale independently from Prometheus, avoiding TSDB size limits and improving query performance at large scale. You have a Prometheus server scraping thousands of targets generating millions of time series. Storing and querying all that data locally can be challenging. By using remote write to send data to a scalable backend like Thanos, you can leverage its distributed architecture to handle large volumes of data efficiently.

  • Easier Horizontal Growth: As you add more Prometheus servers, the workload on each remains small while all data flows to the same backend. Horizontal scaling becomes straightforward since each Prometheus instance only needs to manage its own scrape load and push data out.

With these benefits come some limitations, mainly related to the nature of remote write as a transport mechanism. Some important limitations include:

  • Dependency on Network Reliability: If the remote endpoint is slow or unreachable, queues can back up or samples may be dropped. This is relatively rare and directly depends on your network and backend reliability.

  • Transmission Delay Under Load: Heavy traffic or distant backends introduce slight lag before data becomes available remotely. If you plan to use remote write backends for alerting or dashboards, consider the potential delay.

  • Bandwidth and Throughput Costs: High-cardinality or high-frequency workloads can produce large sustained data streams. If you're paying for what you're using (e.g., cloud egress costs or inter-region data transfer), this can add up very quickly. You don't know what you don't know, start by measuring your current usage, use the cloud cost calculators, and estimate accordingly.

  • No Native Querying: Remote write is a one-way pipeline. Prometheus pushes data out to another system, but it cannot read that data back. Prometheus does not query the remote storage itself. If you want to query the data you pushed away, you need either:

  • Remote_read, which lets Prometheus pull data back from the remote store, or

  • A *separate query layer like Thanos, Cortex, or Mimir, which sits on top of all the stored data and answers your queries.

Think of it like mailing copies of your notes to a library. Remote write sends the notes. Prometheus can’t read from the library unless you add remote_read or use another tool built for searching the library.

Remote Write is ultimately a transport mechanism: it doesn't change how Prometheus scrapes, evaluates rules, or performs local alerts. It simply moves data out to a system built for long-term or aggregated storage.

Observability with Prometheus and Grafana

A Complete Hands-On Guide to Operational Clarity in Cloud-Native Systems

Enroll now to unlock all content and receive all future updates for free.