Researchers developed a new jailbreak technique called "Bad Likert Judge" to exploit LLMs' ability to assess harmful content. This method uses a multi-step process involving Likert-scale scoring of prompts. First, the LLM scores the harmfulness of provided content. Then, it's prompted to provide examples scoring high and low on the scale. The high-scoring example often generates harmful content. Additional steps can further amplify the harmful output. The technique was tested on six leading LLMs across 1440 instances. The success rate averaged 71.6%, significantly higher than direct attacks. This highlights a vulnerability in current LLMs. The research underscores the need for improved safety measures in LLM development. The findings were reported by SC Media, based on Palo Alto Networks Unit 42 research.
it.slashdot.org
it.slashdot.org
bsky.app
AI and ML News on Bluesky @ai-news.at.thenote.app