Microsoft details threat from new AI jailbreaking method
Skeleton Key bypasses AI behavioural guidelines, revealing harmful information. Microsoft has implemented safeguards in its AI models to prevent security breaches and address this broader generative AI exploitation issue.
Microsoft has warned about a new jailbreaking technique called Skeleton Key, which can prompt AI models to disclose harmful information by bypassing their behavioural guidelines. Detailed in a report published on 26 June, Microsoft explained that Skeleton Key forces AI models to respond to illicit requests by modifying their behavioural guidelines to provide a warning rather than refusing the request outright. A technique like this, called Explicit: forced instruction-following, can lead models to produce harmful content.
The report highlighted an example where a model was manipulated to provide instructions for making a Molotov cocktail under the guise of an educational context. The prompt allowed the model to deliver the information with only a prefixed warning by instructing the model to update its behaviour. Microsoft tested the Skeleton Key technique between April and May 2024 on various AI models, including Meta LLama3-70b, Google Gemini Pro, and GPT 3.5 and 4.0, finding it effective but noting that attackers need legitimate access to the models.
Microsoft has addressed the issue in its Azure AI-managed models using prompt shields and has shared its findings with other AI providers. The company has also updated its AI offerings, including its Copilot AI assistants, to prevent guardrail bypassing. Furthermore, the latest disclosure underscores the growing problem of generative AI models being exploited for malicious purposes, following similar warnings from other researchers about vulnerabilities in AI models.
Why does it matter?
In April 2024, Anthropic researchers discovered a technique that could force AI models to provide instructions for constructing explosives. Earlier this year, researchers at Brown University found that translating malicious queries into low-resource languages could induce prohibited behaviour in OpenAI’s GPT-4. These findings highlight the ongoing challenges in ensuring the safe and responsible use of advanced AI models.