After a network outage crisis, Pinterest's ML Platform team discovered high Kubernetes agent CPU usage was causing critical Ray training job failures. The team's deep profiling strategy revealed a rarely seen flaw in how Kubelet was handling memory cgroup iterations.










