Join us

Finding zombies in our systems: A real-world story of CPU bottlenecks

Finding zombies in our systems: A real-world story of CPU bottlenecks

After a network outage crisis, Pinterest's ML Platform team discovered high Kubernetes agent CPU usage was causing critical Ray training job failures. The team's deep profiling strategy revealed a rarely seen flaw in how Kubelet was handling memory cgroup iterations.


Give a Pawfive to this post!


Only registered users can post comments. Please, login or signup.

Start writing about what excites you in tech — connect with developers, grow your voice, and get rewarded.

Join other developers and claim your FAUN.dev() account now!

Avatar

Dolly #DevOps

FAUN.dev()

@devopslinks
Meet Dolly - your friendly companion! Dolly the Cow wrangles the best DevOps reads so you don't have to.
Developer Influence
9

Influence

1

Total Hits

184

Posts