Vector database security: Pinecone

A look at the Pinecone vector database from a privacy, security, and risk management perspective

Pinecone's product

About Pinecone

Pinecone provides a dedicated cloud vector database. The company is notable for having some of the largest venture capital investments from investors like Andreessen Horowitz, and they claim to be the most popular vector database. They particularly claim “fast, cost-efficient performance at any scale” and to be best suited to very large datasets with fast and accurate queries. (We’ll leave validation of these claims to others.)

Pinecone supports sparse as well as dense vectors, whereas most other solutions only focus on dense vectors. This allows for things like hybrid keyword search to work purely with vectors and to provide better search results in some cases compared to pure dense vector embedding searches.

Pinecone is closed-source and SaaS-only, unlike some of their more prominent competitors who are based on open source or have on-prem options for private deployments.

Screenshot of Pinecone's product page header on 2024-03-13

Pinecone's product page on 2024-03-13

Pinecone (in)security

An assessment of Pinecone's security

All vector embeddings and vector databases have inherent issues.

In March, 2024, we gave Pinecone a security maturity score of “weak” (you can see the spreadsheet still) based on our standardized maturity scoring. We also pointed out problems where Pinecone made false claims about having end-to-end encryption and we found their RBAC to be practically non-existent.

We’ve since determined that the end-to-end encryption claims are no longer public (that we can find) and there have been enhancements to their RBAC. Additionally, they have added a Customer Managed Encryption Key (CMEK) feature.

Because these companies are moving so fast, we’re removing our outdated security evaluation and recommend that potential customers do their own. All vector embeddings and vector databases have inherent issues because of the data they hold (see next section for more).

Vector security risks

Vector embeddings are dangerous

And independent of the specific things Pinecone is doing (or not doing) for security, they are storing the vectors, which are dangerous if not treated properly.

Vector embeddings are long arrays of numbers that represent an AI embedding model’s understanding of its input. They can represent photos, faces, fingerprints, and text. They’re used for natural-language search, for facial recognition, and much more. And they’re stored in vector databases.

Vector databases have soared in popularity. As GenAI usage has skyrocketed, vector DB usage has too, since they’re used to power a number of AI workflows, such as the popular Retrieval-Augmented Generation (RAG) (see our explanation of RAG and its security risks) used to inject private data into AI chat bots.

While it’s commonly thought that these vector embeddings are only useful for search and in AI systems, and that they don’t pose a security risk, this is a busted myth. In fact, vector embeddings can be reversed with high accuracy, as we’ve demonstrated while running open-source attacks from academia against text embeddings and face embeddings.

AI is a powerful magnet for sensitive data. The more data these AI systems have, the more useful they are. And the more risk they pose.

Vector databases can’t and won’t ever enforce the same permissions on the data as the native data stores. For example, CRM data will have custom domain-specific logic inside the CRM, but not in the vector database. This makes the vectors a gold mine of data for anyone who can gain access. They contain copies of the most sensitive data in healthcare, finance, personal activities and productivity, and business. A recent academic paper (with open source code) showed how the use of AI embeddings in email systems has enabled the first ever AI worm.

For most organizations, these vector databases (even when they’re added to existing databases) will present the biggest risk to the sensitive data in their infrastructure.

Fixing Pinecone

Using application-layer encryption to mitigate the risks from Pinecone

Today, if you’re looking for ways to comply with privacy and security laws for the data in vector databases, you will find two types of solutions on the market:

1. Redaction/tokenization of PII: a poor approach

The redaction approach is troublesome because it reduces the value of the AI systems. You can’t ask to summarize the report written by so-and-so last week if names are removed. Ditto for dates and such. It’s also troublesome because names and birthdates aren’t the only sensitive data in these systems. They’re chock full of sensitive HR data, financial data, email content, document content, photos, biometrics, and so much more. Attempts to simply redact or tokenize are half-measures, at best.

2. Property-preserving encryption: security and data privacy for everything

Encrypt all the things meme

The best way to mitigate the risks of vector embeddings is to encrypt them at the application layer. It protects the data from any direct-to-database peeking – whether by hackers or insiders at your company or your service provider. It fixes issues with data sovereignty by allowing keys to be sovereign and making subpoenas to service providers useless. And it prevents wholesale data leaks. Perhaps most importantly, if someone does gain unauthorized access to your vector database, that isn’t a breach that must be disclosed, nor an international transfer of data unless they also compromise the key, which is likely elsewhere and better protected.

Myth busted capture

But if you simply did classic encryption of the vectors, then you couldn’t search them. You’d break natural-language queries, searches over images, facial recognition matching, clustering, anomaly detection, and basically everything a vector database is good for. But this isn’t necessarily true.

Property-preserving vector encryption lets developers query a vector database if and only if they have the right key. Without the key, they can’t meaningfully query it. But with the key, everything works as if there was no encryption at all. And better yet, it works with every single vector database, not just Pinecone.

Next steps:

Doing nothing is risky. Protecting your AI data is easy. And inexpensive to boot.

Copy. Paste. Done.

Integration examples show how easy it is

The Cloaked AI docs have integration examples for a number of popular vector databases. We use their examples and then modify them to show how easy it is to add encryption, usually in just a few lines of code.

Take a look at our Pinecone examples on GitHub

Python

# ...
for row in dataset.documents.itertuples():
  plaintext = alloy.PlaintextVector(row.values, "quora", "sentence") 
  encrypted = await sdk.vector().encrypt(plaintext, tenant_id)
# ...

Pinecone encryption code snippet

Other vector database security assessments:

Pinecone - this page
Qdrant

Explainers

Recent Blogs