Security of AI
Vector embeddings produced by machine learning tasks are a prime target for data theft. They're the memory of AI and just as sensitive as the data they derive from. Encrypting the sensitive data you store in vector databases is critical to protecting your company's sensitive data and reputation.
What are AI embeddings?
Embeddings are sometimes called the memory of AI and they are critical for many AI tasks
In the context of machine learning and AI systems, an embedding is an internal representation of the model understanding of its inputs. They are represented as vectors (arrays of numbers) and sometimes called vector embeddings and ML embeddings.
Watch a short embeddings explainer video
Why do AI systems need vector embeddings?
Embeddings enable semantic search, image search, facial recognition, and much more
They’re often used internally by AI systems, but sometimes embeddings are expressly the output of an AI system used to make the system more intelligent and capable while reducing side effects like hallucinations. In these cases, you ask a model to evaluate some input such as text, image, or audio, and it shares back everything it understands about that data in the form of an embedding.
The returned embeddings can then be used in a number of advanced ways including:
- Similarity searches
- Facial recognition
- Voice identification
- Similar image search
- Semantic search (search on meaning instead of keyword)
- Recommendations engines (for products, people, groups, content, etc.)
- Chat dialogs to remember discussion history
What are the privacy risks with vector embeddings?
Vector embeddings are a gold mine of private information
Embeddings are a machine representation of arbitrary data. The better the model, the higher the fidelity of the embedding. Much like humans processing and remembering audible and visual signals and reducing them to an understanding of what’s important in them, an AI model takes similar inputs and reduces them into meaningful memories stored as vector embeddings.
Embedding inversion attacks
Just as you can extract training data back out of models in model inversion attacks, a number of academic papers have recently demonstrated you can do the same thing using embedding inversion attacks on vector embeddings. These take embeddings from a vector of numbers back to the original input or an approximation thereof.
In the paper with the best results so far, attackers were able to recover the exact inputs in 92% of cases including full names and health diagnoses. The remaining 8% recovered data that was largely the same as the original input. In another paper, this one accompanied by open source software allowing anyone to reproduce the attack, the inversions were largely successful at getting back every theme of the original input. For example, where the original text was, “I love playing the cello! It relaxes me!”, they demonstrated getting back text that said, “I love playing the cello! It helps me relax!” in on example. It isn’t a perfect reproduction of the input, but it’s close enough for most use cases.
Membership inference attacks
Also of concern is the ability for an attacker to find out if some input was used in a vector database. If each vector represents a sentence, an attacker might search for “In 23Q3, the company exceeded expectations” as a common phrase used when some company does well and reports that out. With access to a vector database, they could test for that and other expected sentences to see what exists in the system. Or they could test to see if some specific names or locations or faces or other data exists in the data set.
In short, embeddings are equivalent to their inputs and are just as sensitive as any data that was used to create them.
Security and privacy concerns are the top barriers to adoption of AI, and for good reason. Both benign and malicious actors can threaten the performance, fairness, security and privacy of AI models and data.
How do you secure vector embeddings?
Embeddings can be secured with property-preserving application-layer encryption
Embeddings may represent all kinds of private data from facial recognition to voice recognition to confidential text, images, and more. The best way to secure embeddings is using application-layer encryption (ALE), which means you encrypt the data before sending it to a vector database like Pinecone/Qdrant/Weaviate or to an index file using something like FAISS.
In the case of a database or an index file, one option could be to encrypt the file storage at an infrastructure level, but this would not protect the data on a running server.
With ALE, even if someone gains access to the stored data on a running server or gains access to database credentials, the data is senseless to them unless they also have the key.
If the data was randomly encrypted, it would be well protected, but you’d have to decrypt the data before doing anything with it. For example, to do a nearest neighbor search would require downloading all stored vectors, decrypting, and then executing the search.
Using property-preserving encryption, the embedding vectors can be scrambled while retaining some of their structure. The vectors can’t be reversed back to their inputs (or roughly equivalent values), but this allows them to still be queried using operations like kNN approximate nearest neighbor search and k-means clustering.
Only someone with the encryption key can generate an encrypted query that will meaningfully match against the encrypted data. The key is also used to decrypt the returned results.
And the company, server, service, and staff entrusted with the data can do their jobs without adding security and privacy risk to the stored data. Any infrastructure can be used.
Drawbacks of privacy-preserving encryption
Property-preserving encryption is not perfect. It can leak some information. For example, an attacker with access to the encrypted embeddings could see that vector
x is similar to vector
y, but importantly, the attacker couldn’t say what
y correlates to without seeing the unencrypted input or without having the key.
Benefits of privacy-preserving encryption
Few companies host all of their own infrastructure anymore and even for those who do, stolen credentials and misconfigurations are still an issue. Encrypting embeddings reduces this risk of data being breached in the face of application vulnerabilities and other security issues. Perhaps more importantly, companies can meet the demands of privacy laws and data protection laws and the still developing laws governing AI data to the extent those new laws require companies to secure the AI data they hold. It also opens up options for where data is held, with whom, and even how many people within a company can have access to the storage.