Privacy-Preserving AI: The Secret to Unlocking Enterprise Trust

Why Fortune 500 companies are blocking vendors — and how encrypted AI models win them back

Enterprises are slowing their adoption of AI as security and legal teams start to manage some of the risk. When I was at the AI Risk Summit in August, there was a panel of CISOs talking about how they’re managing AI and what their legal teams are doing. Each had a group studying the issue and creating guidelines, but little had been enforced. The one exception: each company was aggressively blocking partners from using their private data to train models. They’d rather leave a vendor than allow that to happen.

The restriction makes sense since AI is well known for leaking data via model inversion attacks. I make this point and demo some of these attacks in my DEF CON 33 talk: Exploiting Shadow Data in AI Models and Embeddings. Model training is one of the things that companies should be wary of when enabling AI capabilities, but it’s just the tip of the iceberg. We’ve talked elsewhere about questions to ask your vendor to get a handle on how risky partner AI features are, for example.

But when it comes to training models, like many things in life, there’s nuance. It isn’t necessarily risky to let a partner use your data for training. AI models can be built using privacy-preserving technologies that prevent training data from being read or data within the models from being leaked. When done correctly, a vendor can build a model for a customer without seeing their data, while still offering the supercharged productivity that can come from AI.

The Legal Perspective: Example Clauses Forbidding Private Data Use

Today, it’s common for enterprises to add clauses to their vendor contracts addressing AI. For example, this clause doesn’t allow a SaaS company to pass any confidential information along to third-party AI providers unless the customer expressly allows it:

Vendor shall not input, upload, transmit, or otherwise disclose any Confidential Information to any artificial intelligence system, large language model, machine learning tool, or similar automated system (“AI System”) unless expressly authorized in writing by [Company Name].

Or this clause, which expressly forbids training on the customer’s data:

Vendor shall ensure that any AI System used in connection with the Services: (a) does not retain, train on, or reuse [Company Name]’s Confidential Information;

But not all training or use of AI is the same, and we will propose below some alternative clauses that carve out exceptions for privacy-preserving technologies.

Technical Measures to Preserve Privacy

AI can be used in privacy-preserving ways. Models can be built from private data without leaking it. And there are several approaches to this today.

Approach 1: Tokenization/Redaction

Easily the most widely available approach uses redaction or tokenization to identify specific problematic bits of data, such as names, addresses, or health diagnoses, and then replace them with placeholders of one kind or another. If you’re looking for a solution like this, you just need to throw a rock in the general direction of security startups, and you’ll likely hit three in one throw.

This approach can be applied to both training data and model inputs before sending the data to a third party. Instead of training on data that says, ”Veruca Salt was diagnosed with Acute Covetitis after screaming about wanting everything and falling down a bad egg checker garbage chute,” the model trains on something like, ”Person123 has Diagnosis921 after screaming about wanting everything and falling down a bad egg checker garbage chute.”

Unfortunately, the tokenization/redaction approach to privacy is problematic on many levels. By merely substituting words, we get pseudonymization, which is easily reversible. It’s often easy to figure out who is the subject of some text (as those who remember Willy Wonka will likely do when reading about Person123). And sensitive information doesn’t reside solely in names and addresses. By allowing sentences to go through, all kinds of sensitive health information, financial projections, intellectual property, private political leanings, etc., can leak.

While tokenization may seem like a great way to check some privacy boxes, it’s not a great way to protect sensitive data.

Approach 2: Encryption

Unfortunately, none of the big LLM providers currently offer a privacy-preserving way to query their large language and vision models. They could absolutely run their models in confidential compute environments using Nvidia H100 GPUs. But they don’t. This leaves anyone who doesn’t want to send confidential data to these AI companies in a bit of a bind. The only viable option for this type of model is to use an open-source model and run it yourself.

Yet large language and vision models are not the only models in wide use. For software teams building their own models, they’re probably trying to solve classification, recommendation, or anomaly detection problems. For example, a health-tech company might be trying to train a model to detect cancer in medical images, such as MRI scans. Or an online store may wish to look at previously purchased and viewed products to make intelligent recommendations for other products a customer should consider. But these use cases require sensitive information. The vendor developing the e-commerce software will need different product recommendation models for each customer, which means looking at sensitive sales information and potentially even more sensitive end-user behavior information to both train models and produce inferences. In the healthcare case, a software company will need to obtain imaging data with known results from hospitals or doctors’ offices, which may be problematic in light of health privacy laws.

Using encryption can solve these problems and more.

If you only need the model encrypted and it’s a relatively small one, then Fully Homomorphic Encryption (FHE) could be a good choice. Several companies offer this, though they are often much slower at producing inferences than unencrypted models are. And this typically requires the training data to be unencrypted for training.

The other approach is a partially homomorphic one that uses approximate-distance-comparison-preserving encryption, which is what underlies IronCore’s Cloaked AI. This approach is pretty neat, in part, because it doesn’t have any noticeable impact on performance (either model accuracy or inference speed). It does require that the model be trained on vector embeddings, but this covers many use cases.

The training data, such as categorized MRI scans or product purchase and browsing histories, is fed into an embedding model, and the resulting vectors are encrypted with a specific key. Optionally, the encryption can be made one-way so the vectors can’t be decrypted.

That training data can then be shared with a vendor or another third party who can use it to build models. These models contain encrypted data and can only be queried with the proper key. Without the proper key, they output garbage. Data scientists building the models can’t read or invert the training data in a meaningful way, but with encrypted test data, they can validate the accuracy of the models they build.

When models are built this way, using encrypted vector embeddings as the training data, they can be shipped to more places, such as the edge. Vendors can make models for customers without the customer risking data loss. Hospitals can share information without violating health privacy laws. Data sovereignty and residency laws can be respected even through the model training and experimentation phase. Almost any use case where the training data is sensitive, the input can be modeled as an embedding, and the model is built from scratch can build in privacy.

Updated Clauses

Companies can allow, or even encourage, the use of privacy-preserving, security-forward AI by tweaking their legal clauses a bit. For example, the confidentiality provision might become:

Vendor shall not input, upload, transmit, or otherwise disclose any Confidential Information to any artificial intelligence system, large language model, machine learning tool, or similar automated system (“AI System”) unless such data is fully protected using an approved data cloaking or encryption technology, such as IronCore Labs’ Cloaked AI, or another method expressly authorized in writing by [Company Name].

And for that clause that governed the use of data for training AI models, it could be modified to read instead:

2.1. Vendor shall ensure that any approved AI System used in connection with the Services: (a) does not train on, retain, or reuse [Company Name]’s Confidential Information, except where such training occurs exclusively on cloaked, encrypted, or otherwise cryptographically protected data rendered incapable of revealing the underlying Confidential Information, through technology such as IronCore’s Cloaked AI or equivalent; (b) processes only cloaked, encrypted, or pseudonymized data, incapable of revealing underlying confidential content, whether during training, inference, or output generation; …

Additionally, the security section of any agreements should be expanded to cover the use of confidential information in AI systems:

Vendor shall implement and maintain appropriate technical and organizational measures to protect Confidential Information from unauthorized access, use, or disclosure, including when processed by AI Systems.

…

Vendor shall maintain and demonstrate a data protection architecture that prevents Confidential Information from being exposed in plain text to any AI System or third party.

Vendor shall use only approved encryption or data cloaking technologies, such as IronCore Labs’ Cloaked AI, to safeguard Confidential Information before any AI or vector database interaction.

All encryption keys or decryption capabilities must remain under the exclusive control of Vendor and/or [Company Name], and not any third-party AI provider.

The Result

Privacy-preserving technology in AI workflows can be a huge boon for companies looking to build AI models or leverage customer or user data in AI systems without introducing security or privacy risks to that data. Any company in a position to build AI models or implement RAG workflows, but whose customers are wary of using their data in these AI systems, can build with privacy in mind and push back on legal requirements that forbid AI training, suggesting alternative ones that get to the heart of the issue and protect confidential information.