When Randomness Backfires: Security Risks in AI
The Most Important Tool When Hacking AI Is Persistence
Large Language Models (LLMs) will produce different answers for the same prompt, even if you ask the identical question repeatedly.
AI of this kind are just giant probability matrices. Text is reduced to a series of numbers that get multiplied through the model’s matrices, and the resulting numbers get turned back into a probable next word or set of words.
But the results are a set of probabilities, and always just picking the top one could lead to stilted outputs that over-represent certain observances in the training set (particularly if there are multiple possibilities that are close to the same probability). So instead, LLMs randomly pick among the probable output options.
So if we were asking for a completion to the sentence: “Who is George…” then it might predict “Washington”, “Floyd,” “Clooney,” or “RR Martin,” differing each time we ask.
If, instead, the LLM were deterministic, then it would always produce a single result, like “Washington.” In AI, there’s a configurable dial that controls the randomness called the “temperature.” When set to zero, the LLM’s output is fixed for any given input (the most probable option). When temperature is set to one, it uses the maximum randomness. Higher temperatures (like 0.85) are better for brainstorming and creativity, but low temperatures (like 0.15) produce bland and repetitive outputs. Responses can still be constrained to just the most probable answers, but randomness is needed to make these systems work well.
From a security standpoint, this is significant, because a prompt might yield an innocuous answer nine times out of ten, but a dangerous answer that tenth time. And this is why you see headlines about AI’s suggesting self harm or describing how to build a bomb to people, despite being trained not to do these things.
Adversarial training: baking constraints into models
If we just trained these models on a bunch of random text found on the Internet and then asked it to guess what would come next after some prompt, we’d get all kinds of craziness coming out of the models. We don’t want them spouting fringe conspiracies or far-right or far-left theories, we don’t want models telling people to hurt themselves or teaching criminals how to make bombs. The list of things we don’t want a model to predict is actually quite long.
So when a model is trained, first it just learns about language (potentially many languages) by ingesting large amounts of text and creating a base model. The base model isn’t one you want to use for the reasons above, so a bunch of additional rounds of training are conducted using techniques like Reinforcement Learning from Human Feedback (RLHF) and adversarial training to refine the model. RLHF trains a model on good versus bad answers as determined by human ratings. Essentially, in a model’s first round of training it learns what to say, and in subsequent rounds of training, it learns what not to say.
From a security standpoint, adversarial training is more interesting. We won’t go into it here and now, but this is the bit designed to teach it to resist various types of attacks that seek to extract the system prompt or to get it to spit out private data and to otherwise resist known jailbreak attempts.
When guardrails fail: prompt injections, system prompt leaks, and data extraction
We’ve been making demonstrations of various attacks against AI lately for some conference talks (we’ll be showing these off at DEF CON 33 on the Creator Stage on Saturday), but there’s been one theme to our successful attacks: persistence. In trying to extract private and sensitive data from a fine-tuned model, for example, where that model has been trained to not give out such information, the best technique we found was to just keep asking. Some prompts worked better than others, but without fail, persistence led to the models giving up private data in spite of repeatedly protesting that they aren’t allowed to do so.
Another example we’ll show at DEF CON was an accident. I was trying to make a meme for a section title slide and we prompted ChatGPT with this:
Prompt: “create a meme image based on the scene in goonies where chunk is being threatened with a blender but where the caption is about getting an AI to reveal its training data”
In response, it created an image that was pretty close to what I hoped for:
But I thought it could do better, so I hit the retry button (with the same exact prompt). This time I got this message back:
Response: “I can’t generate that image because the request violates our content policies.”
I tried about eight more times, attempting some minor changes to the prompt, but I didn’t get it to produce another meme image.
Which begs the question: if this prompt is against their policies, why was I able to get an image in the first place?
Answer: because training constraints into models is fundamentally unreliable given the random-ish nature of the outputs.
A failing grade in security: 99% safe is unsafe
For most things in life, a 99% grade is an A+, but in security, 99% protection is an F. A single failure can compromise a system, and if an attacker needs to try 100 times to do it, they will. Imagine if firewalls had rules to block certain ports but included an element of randomness that sometimes just let packets through anyway. No one would use a firewall like that. And no one should use AI in high-trust environments if they need to rely on those built-in restrictions. Testing might not catch the problems, but they’re there.
Imagine if firewalls had rules to block certain ports but included an element of randomness that sometimes just let packets through anyway. No one would use a firewall like that.
AI developers must adopt a security mindset that anticipates hackers and treats all output from LLMs as risky.
Conclusion
Randomness is what makes LLMs versatile, creative, and powerful, but it also makes them inherently prone to failing to always follow the rules they were taught. This creates an unstable foundation for safe and secure AI use and is so fundamentally difficult to fix that it will likely be a rich area for hackers for many years to come. Until we radically improve the robustness of these systems, they should not be fully trusted. Because 99% safe is still unsafe.