ClickHouse ran five LLMs through an autonomous root cause gauntlet using OpenTelemetry data. None nailed it solo. OpenAI’s o3 and Claude Sonnet 4 came closest. GPT-4.1 was the cheapest brain on the block.
Things got weird under the hood. Token usage spiked unpredictably. Queries slammed observability backends harder than expected.
The big picture: LLMs aren't your on-call engineer. They shine as copilots—speeding up queries, not replacing humans.