Attacking AI systems is a relatively new field of study. Most of the papers we’ve found on the topic (especially around embeddings) are from the past few years. We haven’t found any that were published before 2017. This field is often called adversarial AI, and it’s broken up into a ton of different techniques ranging from prompt injections to others that are most relevant to embeddings:
- Embedding inversion attack – decodes embeddings back into their source data (approximately)
- Attribute inference attack – extracts information about the source data not expressly part of the source material, such as inferring sentence authorship based on embeddings
- Membership inference attack – can be applied directly to models, but in this case uses embeddings to extract training data from the embedding model without knowing anything about that model
For now, we’ll focus on embedding inversion attacks.
One important paper in this space from 2020 is ”Information Leakage in Embedding Models” which makes this claim:
Embedding vectors can be inverted to partially recover some of the input data. As an example, we show that our attacks on popular sentence embeddings recover between 50%–70% of the input words (F1 scores of 0.5–0.7).
The embedding is actually capturing the semantic meaning of those sentences, but they can still recover specific input words. They go on to say:
For embeddings susceptible to these attacks, it is imperative that we consider the embedding outputs containing inherently as much information with respect to risks of leakage as the underlying sensitive data itself. The fact that embeddings appear to be abstract real-numbered vectors should not be misconstrued as being safe.
Another paper published earlier this year takes recovery a step further: ”Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence.” This paper is accompanied by open source code allowing anyone to use their approach to reverse sentence embeddings. Essentially, they train an adversarial model on associations between embeddings and sentences and then use that to recover original meanings. They are focused on the semantic meaning more than words so that they can distinguish between “Alice likes Bob” and “Bob likes Alice” (to use their example) when decoding an embedding.
It’s interesting to note that these ideas could be combined and another model could be used to correct and complete thoughts to recover not just specific sentences, but entire paragraphs at a time, ensuring they make sense and fit well together.
And finally, embedding inversions aren’t just a concern for text data. In the blog, ”Inverting facial recognition models (Can we teach a neural net to convert face embedding vectors back to images?),” the author shows that he can easily reproduce photos of faces just from their embeddings even when he doesn’t know anything about the model used to generate them:
To summarize, with 870 images, we’ve successfully set up a method to decode any embedding produced using a cloud-based face embedding API, despite having no knowledge of the parameters or architecture of the model being used. … We were then able to use all that to derive a way to design a general purpose face embedding decoder, that can decode the outputs from any face embedding network given only its embeddings and associated embeddings; information about the model itself not needed.
“Brute force” attacks
And because many of the embedding models are shared models, such as the model
text-embedding-ada-002 from OpenAI and the commonly used SBERT models like
all-mpnet-base-v2 from Microsoft, reversing embeddings can be even easier.
If you know or can guess what embedding model was used, you can build up a huge number of embeddings that tie back to original sentences and then simply compare a retrieved (or stolen) embedding to the corpus to find known sentences similar to the original unknown input.
Consider, for example, an attacker who gets into a public company’s data stores and is able to download their embeddings. The goal of the attacker is to uncover whether the company will miss, meet, or exceed Wall Street’s expectations in an upcoming earnings report so they can trade its stock accordingly. Companies tend to use specific language when they talk about their earnings. Knowing that, an attacker could simply figure out query embeddings in a vector database with specific phrases looking for very near or exact matches. The idea is not dissimilar from how passwords stored as one-way hashes are “reversed” via dictionary attacks that hash each word in a dictionary for later comparisons.
How Cloaked AI helps
The algorithm that Cloaked AI uses to encrypt the vector changes the values of individual elements of the vector, subject to a constraint that bounds the change in distance between two encrypted vectors to be less than a configurable error threshold. This means that top-k nearest neighbor searches might return slightly different results, but they will still be mostly the same, although the individual vectors are all modified. Experiments we have run using the Generative Embedding Inversion Attack against the models show that the vector encryption severely degrades the quality of the information extracted from the encrypted embeddings by an attacker trained against the unencrypted output of the model.
In addition, Cloaked AI shuffles the elements in the vector as the vector is encrypted. This change is sufficient to basically render an embedding inversion attacker useless. It further enhances the security of the vector data in a multi-segment scenario where data from multiple input groups (tenants in a SaaS system, different segments of data, etc.) is stored in the same vector index. Since the each vector is encrypted using a segment-specific key, vectors in different segments that might have been clustered are moved to different regions of the vector space by the shuffle operation.
Was this page helpful?