Security of AI
embeddings explained

Vector embeddings produced by machine learning tasks are a prime target for data theft. They're the memory of AI and just as sensitive as the data they derive from. Encrypting the sensitive data you store in vector databases is critical to protecting your company's sensitive data and reputation.

What are AI embeddings?

Embeddings are AI snapshots representing the meaning of some input like text or an image

In the context of machine learning and AI systems, an embedding is an internal representation of the model’s understanding of its inputs. Embeddings are stored as vectors (arrays of numbers) and sometimes called vector embeddings or ML embeddings.

Play: Play: Vector encryption mini explainer

Watch a short explainer video

ModelThe knowledge base anddecision center of AI Vector EmbeddingsThe memory of AI encodes whatthe model needs to remember

The ultimate guide to AI security

Infographics and guidance on handling key AI security risks and understanding vulnerabilities. 61% of companies use AI, but few secure it. This white paper covers the key AI risks being overlooked from LLMs to RAG.

Why do AI systems need vector embeddings?

Embeddings enable semantic search, image search, facial recognition, and much more

They’re often used internally by AI systems, but sometimes embeddings are expressly the output of an AI system used to make the system more intelligent and capable while reducing side effects like hallucinations. In these cases, you ask an embedding model to evaluate some input such as text, image, or audio, and it shares back everything it understands about that data in the form of a vector.

The returned embeddings can then be used in a number of advanced ways including:

  • Similarity searches
    • Facial recognition
    • Voice identification
    • Similar image search
    • Semantic search (search on meaning instead of keyword)
  • Recommendations engines (for products, people, groups, content, etc.)
  • Retrieval-Augmented Generation (RAG)
    • Any chatbot or GenAI experience that works with private data that the model wasn’t trained on likely uses RAG to provide data relevant to a query in the request to the main model.
Learn more about RAG security risks
Jimmy Fallon saying, 'I remember you'

via GIPHY

What are the privacy risks with vector embeddings?

Vector embeddings are a gold mine of private information

Equiv Equivalent Data Text, Image,Audio, etc. DATA Model Embedding Model Embedding Vector Embedding Plus

Embeddings are a machine representation of arbitrary data. The better the model, the higher the fidelity of the embedding. Much like humans processing and remembering audible and visual signals and reducing them to an understanding of what’s important in them, an AI model takes similar inputs and reduces them into meaningful memories stored as vector embeddings.

Embedding inversion attacks

Just as you can extract training data back out of models using model inversion attacks, numerous academic papers demonstrate how you can do the same thing using embedding inversion attacks on vector embeddings. These take embeddings from a vector of numbers back to the original input or an approximation thereof.

In the paper with the best results so far, attackers were able to recover the exact inputs in 92% of cases including full names and health diagnoses. The remaining 8% recovered data was largely the same as the original input, meaning synonyms of original words. In another paper, this one accompanied by open source software allowing anyone to reproduce the attack, the inversions were largely successful at getting back every theme of the original input. For example, where the original text was, “I love playing the cello! It relaxes me!”, they demonstrated getting back text that said, “I love playing the cello! It helps me relax!“. It isn’t a perfect reproduction of the input, but it’s close enough for most uses.

Membership inference attacks

Also of concern is the ability for an attacker to find out if some input was used in a vector database. If each vector represents a sentence, an attacker might search for “In 23Q3, the company exceeded expectations” as a common phrase used when some company does well and reports that out. With access to a vector database, they could test for that and other expected sentences to see what exists in the system. Or they could test to see if some specific names or locations or faces or other data exists in the data set – no inversion required.

In short, embeddings are equivalent to their inputs and are just as sensitive as any data that they derive from.

Security and privacy concerns are the top barriers to adoption of AI, and for good reason. Both benign and malicious actors can threaten the performance, fairness, security and privacy of AI models and data.
─ Gartner

Is this a widely-known problem?

OWASP Top 10 for LLM Applications v2.0 highlights problems with embeddings

Excerpt from OWASP LLM Top 10 Threat Modeling diagram showing vector db storage and RAG flows

OWASP published version 2.0 of the their “Top 10 for LLM Applications,” which highlights the biggest threats to application developers embracing AI in their products. This new edition of the top threats list cites “Vector and Embedding Weaknesses” (LLM08) as a top problem. To quote from their description:

Vectors and embeddings vulnerabilities present significant security risks in systems utilizing Retrieval Augmented Generation (RAG) with Large Language Models (LLMs). Weaknesses in how vectors and embeddings are generated, stored, or retrieved can be exploited by malicious actions (intentional or unintentional) to inject harmful content, manipulate model outputs, or access sensitive information.” This lumps a few types of attacks together, but #3 in their “Common Examples of Risks” is “Embedding Inversion Attacks.”

Staying ahead of attacks on AI systems means covering the top attacks against these systems.

How do you secure vector embeddings?

Embeddings can be secured with property-preserving application-layer encryption

Embeddings may represent all kinds of private data from facial recognition to voice recognition to confidential text, images, and more. The best way to secure embeddings is using application-layer encryption (ALE), which means you encrypt the data before sending it to a vector database like Pinecone/Qdrant/Weaviate or to an index file using something like FAISS.

In the case of a database or an index file, one option could be to encrypt the file storage at an infrastructure level, but this would not protect the data on a running server.

With ALE, even if someone gains access to the stored data on a running server or gains access to database credentials, the data is senseless to them unless they also have the key.

Property-preserving encryption

If the data was randomly encrypted, it would be well protected, but you’d have to decrypt the data before doing anything with it. For example, to do a nearest neighbor search would require downloading all stored vectors, decrypting, and then executing the search.

Using property-preserving encryption, the embedding vectors can be scrambled while retaining some of their structure. The vectors can’t be reversed back to their inputs (or roughly equivalent values), but this allows them to still be queried using operations like kNN approximate nearest neighbor search and k-means clustering.

Data-in-use protection

Only someone with the encryption key can generate an encrypted query that will meaningfully match against the encrypted data. The key is also used to decrypt the returned results, if desired, though decryption of vectors is rarely needed.

And the company, server, service, and staff entrusted with the data can do their jobs without adding security and privacy risk to the stored data. Any infrastructure can be used.

Drawbacks of property-preserving encryption

Property-preserving encryption is not perfect. It can leak some information. For example, an attacker with access to the encrypted embeddings could see that vector x is similar to vector y, but importantly, the attacker couldn’t say what x or y correlates to without seeing the unencrypted input or without having the key.

Benefits of privacy-preserving encryption

Few companies host all of their own infrastructure anymore and even for those who do, stolen credentials and misconfigurations are still an issue (the top two causes of breaches, in fact). Encrypting embeddings reduces this risk of data being breached in the face of application vulnerabilities and other security issues. Perhaps more importantly, companies can meet the demands of privacy laws and data protection laws and the still developing laws governing AI data to the extent those new laws require companies to secure the AI data they hold. It also opens up options for where data is held, with whom, and even how many people within a company can have access to the storage.

Open source distance-comparison-preserving encryption (DCPE)

Our approach is based on the paper Approximate Distance-Comparison-Preserving Symmetric Encryption by Fuchsbauer, Ghosal, Hauke, and O’Neill. The code is entirely open source under the AGPL license.

Below is a talk given at DEFCON explaining the algorithm in more detail.

Play: Play: DEF CON 32 talk: Attacks on GenAI data & using vector encryption to stop them with Patrick and Bob

More embedding myths Protect AI data with Cloaked AI