Join us

Failure is inevitable: Learning from a large outage, and building for reliability in depth at

@devopslinks ・ Dec 13,2025

Datadog ditched its “never fail” mindset after a March 2023 meltdown knocked out half its Kubernetes nodes and took major user features down with them. The fix? A full-stack rethink built around graceful degradation.

The team added disk-based persistence at intake, live-data prioritization, QoS-aware retry logic, and localized failover for control plane calls. In other words: no more all-or-nothing. If it breaks, it bends instead.

Give a Pawfive to this post!

Only registered users can post comments. Please, login or signup.

Share with your friends and followers

Start writing about what excites you in tech — connect with developers, grow your voice, and get rewarded.

Join other developers and claim your FAUN.dev() account now!

Publish your first story!

DevOpsLinks #DevOps

FAUN.dev()

@devopslinks

DevOps Weekly Newsletter, DevOpsLinks. Curated DevOps news, tutorials, tools and more!

Developer Influence

35

Influence

1

Total Hits

142

Posts

Join and showcase your work and skills

FAUN.dev() is where engineers from GitHub, Netflix, and Shopify go to stay ahead — fast.