MIT Technology Review

Forcing LLMs to be evil during training can make them nicer in the long run

A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models—and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits. Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly…
favicon
bsky.app
AI and ML News on Bluesky @ai-news.at.thenote.app
favicon
technologyreview.com
technologyreview.com