Use cases for vector databases and Cloaked AI

Vector databases are overflowing with sensitive data. Here are 6 AI techniques where sensitive data is typically stored in the vector database as vector embeddings or as metadata.

1. Recommendation systems

What are recommendation systems?

AI recommendation systems suggest similar or related items based on insights from data sets or histories. Traditionally used to enhance consumer shopping experiences or to provide food and media recommendations (what you see in your Netflix app), recommendation systems are now used in apps for lawyers, financial managers, and even healthcare workers.

What are the privacy and security risks with recommendations systems?

Recommendation systems often rely on vector databases to identify similar foods, media, or consumer products with approximate nearest neighbor searches (ANN). Vector databases are good at measuring degrees of similarity between things, which then power source recommendation systems. However, personalized recommendation systems that utilize sensitive data to provide relevant recommendations risk exposing that data to anyone who gains access to the vector database.

How do I protect sensitive data in a recommendation system?

If you're using a vector database to store and index sensitive data for a recommendation system, then you need to evaluate the risk associated with the data and the level of protection that it warrants. If it involves personally identifiable information, you may need to set up retention policies and have ways to erase specific data related to an individual if you get an erasure request from them. Just because they look like meaningless numbers to us doesn't mean vectors are meaningless. Hackers use embedding inversion attacks to extract original source inputs out of vector embeddings. If you have sensitive metadata associated with the vectors, that becomes an issue as well. To protect the vectors and metadata, encrypt it before it's stored with searchable, drop-in, application-layer encryption.

To learn more, visit Cloaked AI.

2. Retrieval Augmented Generation (RAG)

What is retrieval augmented generation (RAG)?

RAG is a pattern that allows a generally intelligent AI model to answer questions based on data that it wasn't trained on, which may be private or sensitive. RAG is typically used to build out question/answer support over knowledge bases or pools of private data. This is most commonly achieved by putting the sensitive data into vector databases, using those databases to find material relevant to a given query, then providing that material as context to the model so it can answer the question. This is also used to keep AI systems from "hallucinating" plausible but false answers in an application commonly called grounding.

What are the privacy and security risks with retrieval augmented generation (RAG)?

Companies that use RAG to add context and ground AI systems are often storing confidential data in the vector database. This data could be regulated or sensitive and represent large risk to the company if it were compromised. In personalized RAG workflows, the chatbot might answer questions about a person's medical or financial history, for instance, requiring it to have access to extremely sensitive data. Unfortunately, this data is vulnerable even when it's represented as a vector in a vector database and in many cases, the vectors have associated metadata that can also contain private information that can be viewed by hosting providers, hackers, governments, and others.

How do I protect sensitive data in RAG?

Private data used to ground AI systems needs to be protected at the source and in any derivative copies, such as the associated vectors in a vector database. Additionally, the data should be labeled and evaluated to understand if it needs to meet specific data retention policies or if it is subject to erasure requests. Identify what data is being used in the RAG workflow including any vectors and metadata. If any of that data has a high associated risk should it become public, you need to add data protection controls. The best way to protect the data, whether it's on-prem or hosted, is to encrypt the data before it's stored with searchable, drop-in, application-layer encryption.

To learn more, view our in-depth guide on Security Risks in RAG Architectures.

3. Biometric systems

What are biometric systems?

Biometric systems include face recognition, speech recognition, fingerprint recognition, iris recognition, author recognition (based on writing style), and behavior recognition. Because these things inherently have variances at the sensor level, at least, a vector database is a common choice for storing them. Two faces are the same if they are sufficiently similar and a search can be done for all similar faces in a repository. The same is true for voiceprints, fingerprints, and so on.

What are the privacy and security risks with biometric systems?

Anything that can identify a person is, by definition, personally identifiable information (PII), which is regulated by privacy laws around the world. Many of those laws set biometric data at a higher level of sensitivity than other identifying information like addresses because of the potential harm to end users if their data is compromised.

How do I protect sensitive data in a biometric system?

Biometric data should always be stored in an encrypted form. Infrastructure and disk-level encryption are insufficient when running on always-on servers because the data is effectively unprotected until the hard drives are removed from servers. Vectors representing biometrics and any associated data can be encrypted before they're stored in the vector database without losing their utility. Nearest neighbor searches continue to work when the person querying has the relevant encryption key.

To learn more, visit Cloaked AI.

4. Anomaly and fraud detection

What are anomaly detection and fraud detection AI systems?

AI brings intelligence to the task of labeling and identifying data sets and clustering those sets so that outliers, anomalies, and other patterns can be detected. Vector databases can be used to group sets of data such as known bad behaviors (fraud, etc.) and known good behaviors. It can then study new behaviors and see if they're similar to good behaviors, bad behaviors, or if they are anomalous and worthy of investigation. Vector databases have the benefit of allowing AI systems to constantly learn and evaluate so bad behaviors can be identified and counteracted in real-time.

What are the privacy and security risks with anomaly detection and fraud detection systems?

Most of the time, anomaly and fraud detection systems are working on user behavior signals of various kinds, which means they're effectively tracking the behaviors of both the good actors in the system and the bad actors. This sort of data is inherently sensitive as it may represent purchase histories, locations, browsing history, or many other types of behavior that are both identifiable and potentially things that would harm the data subject if they were made public. These systems sometimes also use photos and videos, which also have privacy implications.

How do I protect sensitive data in an anomaly detection or fraud detection system?

Whenever dealing with sensitive data of this nature, you'll want to figure out retention times and methods for selective erasure. Embedding inversion attacks can be used to take seemingly useless vector data and to turn it back into human-readable sensitive data. Understanding this, you should evaluate the risk that the source data provides and make protection decisions based on that level of risk. The best approach is to encrypt the data before it's stored or indexed with searchable, drop-in, application-layer encryption, which will still allow these systems to function while protecting the data.

To learn more, visit Cloaked AI.

5. Similar image search

What is similar image search?

Also called reverse image search, similar image search powers tools like Google Images and TinEye where you upload an image, and a search is conducted for similar ones. Embedding models are used to capture meaning and information from the images in the form of vector embeddings. These embeddings are stored in vector databases, which perform nearest neighbor searches to find similar images.

What are the privacy and security risks with similar image search?

If the images are already public, there is no risk. If the images may contain confidential information or be private in nature, then the vectors, too, should be treated similarly. Research has shown that embedding vectors can be reversed into images that are close approximations of the originals.

How do I protect sensitive data in AI systems that use similar image search?

You must protect both the source image and any linked and derivative data such as vector embeddings and their metadata. These should be considered as sensitive as the original source data. If that data is confidential, then it warrants encrypting both the source data and the equivalent vector database data. IronCore products can help with both of these with Cloaked AI protecting the vector database data, allowing only people with the encryption key to make sense of the data or use it in searches.

To learn more, visit Cloaked AI.

6. Semantic text search

What is semantic text search?

Semantic search is another way of saying meaning-based search. Unlike keyword searches, which find a specific word or its synonyms, semantic search queries over concepts and meanings so even if different words are used it can find matches. It also avoids matches against words with multiple meanings where some are unrelated to the query. This is the future of text-based search.

What are the privacy and security risks with semantic text search?

Sentences or other chunks of content are run through embedding models to produce vector embeddings, which are stored in vector databases and used to power semantic search. Research shows that these embeddings can be turned back into their original text with 92% of the original words and 100% of the original meaning so if the input text is confidential, private, or sensitive, then the vectors are as well.

How do I protect sensitive data in AI systems that use semantic text search?

Make sure you protect your search data to keep unauthorized users, such as curious administrators, hackers, and even foreign government subpoenas from gaining backdoor access to your data through your vector database. You can do this by encrypting the embeddings and any associated metadata with data-in-use application-layer encryption so that only you or your authorized applications or users can query the data. Anyone else sees only meaningless numbers.

To learn more, visit Cloaked AI.