Cotool threw frontier LLMs at real-world SecOps tasks using Splunk’s BOTSv3 dataset. GPT-5 topped the chart in accuracy (62.7%) and gave the best results per dollar. Claude Haiku-4.5 blazed through tasks fastest, just 240 seconds on average, maxing out tool integrations. Gemini-2.5-pro flopped on both accuracy and reliability, with repeat failures.










