Vector database security: Pinecone

A look at the Pinecone vector database from a privacy, security, and risk management perspective

Pinecone's product

About Pinecone

Pinecone provides a dedicated cloud vector database. The company is notable for having some of the largest venture capital investments from investors like Andreessen Horowitz, and they claim to be the most popular vector database. They particularly claim “fast, cost-efficient performance at any scale” and to be best suited to very large datasets with fast and accurate queries. (We’ll leave validation of these claims to others.)

Pinecone supports sparse as well as dense vectors, whereas most other solutions only focus on dense vectors. This allows for things like hybrid keyword search to work purely with vectors and to provide better search results in some cases compared to pure dense vector embedding searches.

Pinecone is closed-source and SaaS-only, unlike some of their more prominent competitors who are based on open source or have on-prem options for private deployments.

Screenshot of Pinecone's product page header on 2024-03-13

Pinecone's product page on 2024-03-13

Pinecone (in)security

An assessment of Pinecone's security

All vector embeddings and vector databases have inherent issues (see next section for more), but ignoring vector embedding issues for now, here’s what Pinecone gets right and where they miss the mark.

Weak security maturity

First, we used our creative commons licensed SaaS Security Maturity Evaluation checklist to evaluate the overall security of Pinecone’s SaaS offering. The result was a dismal “WEAK” rating. We encourage you to look at the filled out checklist on Google Sheets to see the details.

Screenshot of weak security maturity score

In short, they have some of the basics, but are also missing important bits. They’re only a few points away from being in the low end of the “reasonable” classification, but a company that potentially holds very sensitive data should do better than “reasonable,” which is more appropriate for companies that handle low sensitivity data. And “Weak” is a terrible showing for any SaaS company.

On the positive side, Pinecone has SOC2 Type II, which means they at least have some organizational and technical structures in place and do at least basic pen testing and an auditor to make sure they meet their own policies for security. It’s good to see that they offer SSO to some customers as well.

Finally, they promise that they “never use your data other than servicing API calls,” which is good. Except this likely only applies to your database data, but not to information they hold about you. Their privacy page says they reserve the right to share your personal info with third-party advertisers, other software companies like Facebook, and more.

False claims

Unfortunately, there’s more bad news. From their security guide: they claim to provide End-to-end Encryption (E2EE) (“Pinecone provides end-to-end encryption for user data, including encryption in transit and at rest.”). But they actually don’t.

Screenshot of Pinecone's bogus E2EE claim

  • 🤯 It’s astounding they’d mislead their customers with this boldly false claim. They are characterizing the table-stakes minimal safeguards of HTTPS and disk-level encryption as E2EE.
  • This is exactly what got Zoom into trouble with the FTC.

Screenshot of Ars Technica headline: Zoom lied to users about end-to-end encryption for years, FTC says

The encryption they offer does not bring meaningful data protection to the customer data they hold on running systems. Any of their ops or engineering staff with access can read your data (and so can anyone who hacks into their infrastructure). They do not have application-layer encryption (ALE), which would meaningfully protect data on a running server, and they certainly haven’t designed a system that would prevent them from ever reading the vectors or metadata they hold, which is the test for E2EE.

RBAC dreams

Finally, Pinecone claims they offer role-based access control (RBAC) as a security feature, which they highlight. And this is true, strictly speaking, but it’s also misleading since their RBAC is incredibly weak. Their API keys have zero permissions or limitations, which means any API key can be used to not just query, but to insert and delete data and more besides. And for users, there are only two simplistic roles: “owner” and “user”. Both roles can create and delete indices and data. The owner simply has billing controls while the user doesn’t. And these roles, such as they are, are per-project, which is a very coarse-grained permission. There are no per-index or per-data segment permissions of any kind.

Screenshot of Pinecone's user roles

Conclusion

Security is hard. We get that. Making false claims, though, is extremely bad for credibility. But at least their privacy policy is honest when it says, we cannot guarantee the security of personal information.” And, right, of course not, but also… buyer beware.

Quick Summary

  • Security maturity checklist score of "weak"
  • SOC2 Type II certified
  • SSO available to some customers
  • Promise to never use your data other than "servicing API calls"
  • But their privacy page says they reserve the right to share your personal info with third-party advertisers, software companies like Facebook, and more which is at odds with the other statement
  • False claim that they provide end-to-end encryption
  • No meaningful data protection
  • Offer role-based access control (RBAC)
  • But, the RBAC implementation is extremely weak

Note 1: Assessment as of 2024-03-14 based on public data.

Note 2: Pinecone has done some great work and this is not meant to take away from that. It is, however, intended to shed light on issues so that customers are aware of the risks when evaluating a purchase. This is not a buy or don’t buy recommendation.

Vector security risks

Vector embeddings are dangerous

And independent of the specific things Pinecone is doing (or not doing) for security, they are storing the vectors, which are dangerous if not treated properly.

Vector embeddings are long arrays of numbers that represent an AI embedding model’s understanding of its input. They can represent photos, faces, fingerprints, and text. They’re used for natural-language search, for facial recognition, and much more. And they’re stored in vector databases. Security risks of rag

Vector databases have soared in popularity. As GenAI usage has skyrocketed, vector DB usage has too, since they’re used to power a number of AI workflows, such as the popular Retrieval-Augmented Generation (RAG) (see our explanation of RAG and its security risks) used to inject private data into AI chat bots.

face embedding attack abstract While it’s commonly thought that these vector embeddings are only useful for search and in AI systems, and that they don’t pose a security risk, this is a busted myth. In fact, vector embeddings can be reversed with high accuracy, as we’ve demonstrated while running open-source attacks from academia against text embeddings and face embeddings.

AI is a powerful magnet for sensitive data. The more data these AI systems have, the more useful they are. And the more risk they pose.

ComPromtMized AI Worm Logo Vector databases can’t and won’t ever enforce the same permissions on the data as the native data stores. For example, CRM data will have custom domain-specific logic inside the CRM, but not in the vector database. This makes the vectors a gold mine of data for anyone who can gain access. They contain copies of the most sensitive data in healthcare, finance, personal activities and productivity, and business. A recent academic paper (with open source code) showed how the use of AI embeddings in email systems has enabled the first ever AI worm.

For most organizations, these vector databases (even when they’re added to existing databases) will present the biggest risk to the sensitive data in their infrastructure.

Fixing Pinecone

Using application-layer encryption to mitigate the risks from Pinecone

Today, if you’re looking for ways to comply with privacy and security laws for the data in vector databases, you will find two types of solutions on the market:

1. Redaction/tokenization of PII: a poor approach

The redaction approach is troublesome because it reduces the value of the AI systems. You can’t ask to summarize the report written by so-and-so last week if names are removed. Ditto for dates and such. It’s also troublesome because names and birthdates aren’t the only sensitive data in these systems. They’re chock full of sensitive HR data, financial data, email content, document content, photos, biometrics, and so much more. Attempts to simply redact or tokenize are half-measures, at best.

2. Property-preserving encryption: security and data privacy for everything

Encrypt all the things meme

The best way to mitigate the risks of vector embeddings is to encrypt them at the application layer. It protects the data from any direct-to-database peeking – whether by hackers or insiders at your company or your service provider. It fixes issues with data sovereignty by allowing keys to be sovereign and making subpoenas to service providers useless. And it prevents wholesale data leaks. Perhaps most importantly, if someone does gain unauthorized access to your vector database, that isn’t a breach that must be disclosed, nor an international transfer of data unless they also compromise the key, which is likely elsewhere and better protected.

Myth busted capture

But if you simply did classic encryption of the vectors, then you couldn’t search them. You’d break natural-language queries, searches over images, facial recognition matching, clustering, anomaly detection, and basically everything a vector database is good for. But this isn’t necessarily true.

Property-preserving vector encryption lets developers query a vector database if and only if they have the right key. Without the key, they can’t meaningfully query it. But with the key, everything works as if there was no encryption at all. And better yet, it works with every single vector database, not just Pinecone.

Next steps:

Doing nothing is risky. Protecting your AI data is easy. And inexpensive to boot.

Copy. Paste. Done.

Integration examples show how easy it is

The Cloaked AI docs have integration examples for a number of popular vector databases. We use their examples and then modify them to show how easy it is to add encryption, usually in just a few lines of code.

Take a look at our Pinecone examples on GitHub

Python
# ... for row in dataset.documents.itertuples(): plaintext = alloy.PlaintextVector(row.values, "quora", "sentence") encrypted = await sdk.vector().encrypt(plaintext, tenant_id) # ...

Pinecone encryption code snippet

Other vector database security assessments: