Forcing LLMs to be evil during training can make them nicer in the long run
Researchers built an automated pipeline to hunt down the neuron patterns behind bad LLM behavior—sycophancy,hallucinations,malice, the usual suspects. Then they trained models to watch for those patterns in real time. Anthropic didn’t just steer modelsaftertraining like most. They baked the correct..