Product announcement: Try the new Cloaked AI beta

Security of AI
embeddings explained

Vector embeddings produced by machine learning tasks are a prime target for data theft. They're the memory of AI and just as sensitive as the data they derive from. Encrypting the sensitive data you store in vector databases is critical to protecting your company's sensitive data and reputation.

What are AI embeddings?

Embeddings are sometimes called the memory of AI and they are critical for many AI tasks

In the context of machine learning and AI systems, an embedding is an internal representation of the model understanding of its inputs. They are represented as vectors (arrays of numbers) and sometimes called vector embeddings and ML embeddings.

Play: Play: What are embeddings?

Watch a short embeddings explainer video

ModelThe knowledge base anddecision center of AI Vector EmbeddingsThe memory of AI remembersanything evaluated by the model

Why do AI systems need vector embeddings?

Embeddings enable semantic search, image search, facial recognition, and much more

They’re often used internally by AI systems, but sometimes embeddings are expressly the output of an AI system used to make the system more intelligent and capable while reducing side effects like hallucinations. In these cases, you ask a model to evaluate some input such as text, image, or audio, and it shares back everything it understands about that data in the form of an embedding.

The returned embeddings can then be used in a number of advanced ways including:

  • Similarity searches
    • Facial recognition
    • Voice identification
    • Similar image search
    • Semantic search (search on meaning instead of keyword)
  • Recommendations engines (for products, people, groups, content, etc.)
  • GenAI
    • Chat dialogs to remember discussion history
Jimmy Fallon saying, 'I remember you'


What are the privacy risks with vector embeddings?

Vector embeddings are a gold mine of private information

Equiv Equivalent Data Text, Image,Audio, etc. DATA Model Model Embedding Vector Embedding Plus

Embeddings are a machine representation of arbitrary data. The better the model, the higher the fidelity of the embedding. Much like humans processing and remembering audible and visual signals and reducing them to an understanding of what’s important in them, an AI model takes similar inputs and reduces them into meaningful memories stored as vector embeddings.

For example, with large language models (LLMs), the original text can be recreated with great accuracy from the embeddings. The AI system could take a document holding secret corporate strategy and answer questions about it or even recreate it based on the embeddings it has. The recreated document wouldn’t use the same words or phrasings, but it would produce content that is entirely equivalent to the input.

Consequently, if the data being ingested by the AI system is in some way sensitive and in need of protection, then any derived embeddings will also need protection and special handling.

Security and privacy concerns are the top barriers to adoption of AI, and for good reason. Both benign and malicious actors can threaten the performance, fairness, security and privacy of AI models and data.
─ Gartner

How do you secure vector embeddings?

Embeddings can be secured with property-preserving application-layer encryption

Embeddings may represent all kinds of private data from facial recognition to voice recognition to confidential text, images, and more. The best way to secure embeddings is using application-layer encryption (ALE), which means you encrypt the data before sending it to a vector database like Pinecone/Qdrant/Weaviate or to an index file using something like FAISS.

In the case of a database or an index file, one option could be to encrypt the file storage at an infrastructure level, but this would not protect the data on a running server.

With ALE, even if someone gains access to the stored data on a running server or gains access to database credentials, the data is senseless to them unless they also have the key.

Property-preserving encryption

If the data was randomly encrypted, it would be well protected, but you’d have to decrypt the data before doing anything with it. For example, to do a nearest neighbor search would require downloading all stored vectors, decrypting, and then executing the search.

Using property-preserving encryption, the embedding vectors can be scrambled while retaining some of their structure. The vectors can’t be reversed back to their inputs (or roughly equivalent values), but this allows them to still be queried using operations like kNN approximate nearest neighbor search and k-means clustering.

Data-in-use protection

Only someone with the encryption key can generate an encrypted query that will meaningfully match against the encrypted data. The key is also used to decrypt the returned results.

And the company, server, service, and staff entrusted with the data can do their jobs without adding security and privacy risk to the stored data. Any infrastructure can be used.

Drawbacks of privacy-preserving encryption

Property-preserving encryption is not perfect. It can leak some information. For example, an attacker with access to the encrypted embeddings could see that vector x is similar to vector y, but importantly, the attacker couldn’t say what x or y correlates to without seeing the unencrypted input or without having the key.

Benefits of privacy-preserving encryption

Few companies host all of their own infrastructure anymore and even for those who do, stolen credentials and misconfigurations are still an issue. Encrypting embeddings reduces this risk of data being breached in the face of application vulnerabilities and other security issues. Perhaps more importantly, companies can meet the demands of privacy laws and data protection laws and the still developing laws governing AI data to the extent those new laws require companies to secure the AI data they hold. It also opens up options for where data is held, with whom, and even how many people within a company can have access to the storage.

Sign up for the webinar Join the waitlist