Patrick Walsh
Originally published at www.forbes.com.

Forbes: AI Systems And Vector Databases Are Generating New Privacy Risks

The proliferation of generally intelligent AI models is turning machine learning projects on their heads and changing the source of risk when building AI projects. Projects that used to require painstaking assembly of training data and the building, testing and refining of models can now be accomplished much faster and easier by using these shared models like the large language and vision models whose popularity has soared this year.

The widespread adoption of these AI models is dramatically changing where sensitive data lives. As a result, the sensitive data is shifting from the model to “the memory of AI,” otherwise known as the vector embeddings.

Look To The Memory, Not The Model

What are vector embeddings? Embeddings are sequences of numbers that represent a model’s impression of some input. They can represent a face, a sentence, an image, a voice and much more. Embeddings are stored in vector databases, which are seeing a rush of investment and adoption across cloud infrastructures this year.

AI systems use embeddings to capture the meaning of an input and then compare these inputs mathematically. For example, you can capture the meaning of a sentence as an embedding and then find all other sentences with similar meanings by searching for nearby vectors. Instead of searching on keywords, this approach finds similar content regardless of the exact words used, and it can differentiate between Alice likes Bob and Bob likes Alice. This is called semantic search, and it also clarifies words with multiple meanings, like the difference between “tactics to build strong arms” and “strong-arm tactics,” which share words but are only distantly related at a semantic level.

Vector Embeddings Are Rich With Sensitive Data And Are A Hacker’s Dream

To us, embeddings look like a bunch of meaningless numbers. There’s no obvious name or birthdate visible in the vector, which leads many to assume that embeddings are innocuous data. But that’s a misconception.

Embedding inversion attacks pull rich data back out of embeddings, including people’s names, medical diagnoses and much more. Original sentences can be recreated in theme, if not in exact words. There’s already open-source software available to conduct these attacks.

Embeddings that represent faces can similarly be reversed using techniques that generate facial images until one is recognized as the same as the original. Any embedding is subject to attacks that can reverse it back to something equivalent to the original input.

And that’s not the only type of attack against these embeddings. Membership inference attacks are also standard fare. This type of attack has mostly been used against models to determine whether certain data was in an original training set, but the same type of attack can be used on embeddings to determine whether certain content is held by a vector database.

Sometimes, an attacker won’t even have to use fancy attacks to get at the original data because the vectors often have associated data and metadata that ride along with them. The attached data often includes private information or even the original raw input used to generate the embedding.

Rapid Progress Creates Tension With Privacy Regulations

Developers are moving fast to adopt new technologies that make their applications more powerful, their businesses more intelligent and their jobs easier. But AI systems that rely on vector databases are introducing risks faster than the defenders can keep up. Security, privacy and compliance teams are on their heels trying to catch up.

Regulations like the EU’s flagship privacy law, GDPR, require data protection by design and by default. It mostly applies to new features and functionality, and that raises the question: When building new GenAI and artificial general intelligence capabilities, what should companies do to mitigate the risks, protect the new data and data stores and stay ahead of current and burgeoning regulations?

Treat Embeddings Like The Source Data

For compliance teams, the first rule is to ask detailed questions. Find out what data is being stored in vector databases and what data those vectors were derived from. If the source data involves personally identifiable information or other sensitive data, then the vector embeddings should be treated as if they are that source data. That means tracking against retention policies, being able to handle erasure requests and protecting sensitive data at the highest practical level.

Use existing compliance tools to track this new data much as you would track any other data. Be careful when relying on ideas of “anonymized data” with embeddings, given the amount of sensitive data and the level of identifiability that can be baked into it. For example, if you have facial recognition data but only associate that with an ID and not a name, is that anonymized? Not really, no. Because the original face can be recreated. And it’s much the same with text data. Besides capturing semantic meaning, embeddings have been used to reverse sentence authorship by analyzing the writing style directly from the embeddings.

Protect The AI Memory

For the data stored in vector databases, there’s a new class of solutions emerging that focus on protecting sensitive vector embeddings. These solutions encrypt the vectors but still allow the nearest neighbor queries, clustering and other operations that vector databases are optimized for. Once encrypted, only people with the proper keys can query the data or make any sense of it. This is data-in-use encryption for vectors, and it foils embedding inversion and membership inference attacks.

For security and privacy teams trying to find a balance between compliance, risk mitigation and empowering development teams to build business-transforming tools, these new data protection technologies are a gift. Developers can even be allowed to use third-party hosted solutions with private data if that data is encrypted before going to the database-as-a-service provider.

With the right policies, the right questions and the right tools, organizations can take advantage of the transformative power of generative AI while also protecting sensitive customer data.