Vector database security: Qdrant
A look at the Qdrant vector database from a privacy, security, and risk management perspective
Qdrant's product
About Qdrant
Qdrant is an open source vector database that can be run either via their cloud service or as a Docker container in your own infrastructure. They have a good reputation and are one of the more interesting open source projects to follow in this space.
The Qdrant database is written in Rust, which is a wise choice both for memory safety (eliminating a major class of attacks) and performance (side note: we also write most everything we do at IronCore in Rust so maybe we’re biased here in how important this choice is, but consider that it also helps to catch bugs in code before they ship).
The database supports filtering on associated data and can work with both sparse and dense vectors using built-in hybrid search options. They have public benchmarks showing that they’re much faster than the handful of folks they chose to compare against (Pinecone is notably absent, for example, though it’s hard to make a reasonable comparison since you can’t run Pinecone on your own hardware so that makes sense).
Qdrant's product page on 2024-03-21
Qdrant (in)security
An assessment of Qdrant's security
All vector embeddings and vector databases have inherent issues because of the data they hold (see next section for more). Regardless, we look at the basics of security to see what Qdrant gets right and where they miss the mark.
SaaS option: weak security
We only base our evaluations on purely public data. Unfortunately, Qdrant has almost no security information published. They use Drata to power their “trust center” and to indicate what policies they have, but they don’t give any info on those policies unless you give them your personal information and sign a NDA, which we won’t do. The security PDF linked from their cloud service had the most information and it’s incredibly short and basic.
Because Qdrant can run either as a cloud service or self-hosted, the analysis is trickier. For those who self-host, Qdrant has a Security Guide, which explains how to properly set up the database. We’ll get back to that in a minute.
First we’ll look at their hosted service. We used our creative commons licensed SaaS Security Maturity Evaluation checklist to evaluate the overall security of Qdrant’s SaaS offering. The result was the LOWEST “WEAK” RATING we’ve yet encountered with just four points.
We encourage you to look at the filled out checklist on Google Sheets to see the details, but in short, they only have the most basic of security precautions in place.
Misleading PCI claims
One misleading portion of their security PDF says, “The servers and services hosted on them are certified as complying with the PCI Data Security Standard … The certification confirms that the services adhere to the PCI DSS Level 4 requirements for security management, policies, procedures, network architecture, software design, and other critical protective measures.”
- This statement seems to say that the Qdrant services are PCI certified. Or, possibly, that because AWS/GCP/Azure have PCI certifications, then any services (including Qdrant) running on them are automatically PCI compliant.
- This is false. Qdrant is not PCI certified and no part of certification is automatic. You can check and see if a company has a PCI certification on the public PCI website. Qdrant isn’t listed.
Now, not having PCI isn’t itself a bad thing – as long as they aren’t processing payments – but making misleading statements about certifications is problematic. Trust is hard to earn, but easy to lose, and nothing loses trust faster than bogus claims.
Lack of certifications
Qdrant appears to lack any third-party certifications, audits, or pen tests – certainly they don’t publicly claim to have these – which means there’s no way to be sure that any cybersecurity policy they profess is actually followed.
Self-hosted option: insecure defaults
But what about hosting Qdrant yourself? For that analysis, let’s look at their security guide. To their credit, they make a big deal about the configuration being insecure by default to motivate people to fix it. There’s a lack of TLS by default and a lack of any sort of authentication by default. Additionally, there’s no authentication for intra-service communications, meaning that any firewall misconfiguration exposing the wrong ports would put the service at risk even if TLS and auth are set up properly.
Despite the docs, it’s a very bad idea to ship a database service with a configuration that’s insecure by default. Recent history shows why this is the case:
Insecure defaults cautionary tale #1: MongoDB
Mongo, for example, started off with insecure defaults as a way to make it easy for developers to get started. After a period of constant headlines showing Mongo databases being pwned, they changed their stance.
Insecure defaults cautionary tale #2: Elasticsearch
Like MongoDB, Elasticsearch started life with no authentication by default and with documentation that warned users to set up security before deploying. It went about as well as it did for Mongo, with mass exploitations and lots of headlines about databases being raided, ransomed, or ruined.
But in case you think these problems are restricted to unsophisticated groups lacking proper security postures, take a look at what happened to Microsoft:
A large Elasticsearch cluster holding all of Microsoft’s customer support conversations with their customers was accidentally made public. And it had the default of no authentication, so anyone who stumbled on the URL was able to browse it all. Hacker skill level required: zero.
Conclusion
While Qdrant is off to a great start with core features, performance, and community engagement, they have a long ways to go on the security front. The lack of any certifications, poor maturity score, and disappointing choices noted above are reasons to be wary of entrusting unencrypted data to Qdrant whether in the cloud or self-hosted.
Quick Summary
Cloud
- Open source
- Almost no security information published
- The LOWEST "WEAK" RATING for security maturity we've yet encountered
- Not PCI certified, despite claims to the contrary
- Lack of any third-party certifications, audits, or pen tests
On-prem
- Open source
- Good documentation on how to fix default insecurities before production
- Defaults to unencrypted connections (no TLS)
- Defaults to no authentication needed
- No authentication for intra-service communications even if otherwise configured correctly meaning an exposed port could lead to full compromise of data
Note 1: Assessment as of 2024-03-21 based on public data.
Note 2: Qdrant has done some great work and this is not meant to take away from that. It is, however, intended to shed light on issues so that customers are aware of the risks when evaluating a purchase. This is not a buy or don’t buy recommendation.
Vector security risks
Vector embeddings are dangerous
And independent of the specific things Qdrant is doing (or not doing) for security, they are storing the vectors, which are dangerous if not treated properly.
Vector embeddings are long arrays of numbers that represent an AI embedding model’s understanding of its input. They can represent photos, faces, fingerprints, and text. They’re used for natural-language search, for facial recognition, and much more. And they’re stored in vector databases.
Vector databases have soared in popularity. As GenAI usage has skyrocketed, vector DB usage has too, since they’re used to power a number of AI workflows, such as the popular Retrieval-Augmented Generation (RAG) (see our explanation of RAG and its security risks) used to inject private data into AI chat bots.
While it’s commonly thought that these vector embeddings are only useful for search and in AI systems, and that they don’t pose a security risk, this is a busted myth. In fact, vector embeddings can be reversed with high accuracy, as we’ve demonstrated while running open-source attacks from academia against text embeddings and face embeddings.
AI is a powerful magnet for sensitive data. The more data these AI systems have, the more useful they are. And the more risk they pose.
Vector databases can’t and won’t ever enforce the same permissions on the data as the native data stores. For example, CRM data will have custom domain-specific logic inside the CRM, but not in the vector database. This makes the vectors a gold mine of data for anyone who can gain access. They contain copies of the most sensitive data in healthcare, finance, personal activities and productivity, and business. A recent academic paper (with open source code) showed how the use of AI embeddings in email systems has enabled the first ever AI worm.
For most organizations, these vector databases (even when they’re added to existing databases) will present the biggest risk to the sensitive data in their infrastructure.
Fixing Qdrant
Using application-layer encryption to mitigate the risks from Qdrant
Today, if you’re looking for ways to comply with privacy and security laws for the data in vector databases, you will find two types of solutions on the market:
1. Redaction/tokenization of PII: a poor approach
The redaction approach is troublesome because it reduces the value of the AI systems. You can’t ask to summarize the report written by so-and-so last week if names are removed. Ditto for dates and such. It’s also troublesome because names and birthdates aren’t the only sensitive data in these systems. They’re chock full of sensitive HR data, financial data, email content, document content, photos, biometrics, and so much more. Attempts to simply redact or tokenize are half-measures, at best.
2. Property-preserving encryption: security and data privacy for everything
The best way to mitigate the risks of vector embeddings is to encrypt them at the application layer. It protects the data from any direct-to-database peeking – whether by hackers or insiders at your company or your service provider. It fixes issues with data sovereignty by allowing keys to be sovereign and making subpoenas to service providers useless. And it prevents wholesale data leaks. Perhaps most importantly, if someone does gain unauthorized access to your vector database, that isn’t a breach that must be disclosed, nor an international transfer of data unless they also compromise the key, which is likely elsewhere and better protected.
But if you simply did classic encryption of the vectors, then you couldn’t search them. You’d break natural-language queries, searches over images, facial recognition matching, clustering, anomaly detection, and basically everything a vector database is good for. But this isn’t necessarily true.
Property-preserving vector encryption lets developers query a vector database if and only if they have the right key. Without the key, they can’t meaningfully query it. But with the key, everything works as if there was no encryption at all. And better yet, it works with every single vector database, not just Qdrant.
Next steps:
Doing nothing is risky. Protecting your AI data is easy. And inexpensive to boot.
Copy. Paste. Done.
Integration examples show how easy it is
The Cloaked AI docs have integration examples for a number of popular vector databases. We use their examples and then modify them to show how easy it is to add encryption, usually in just a few lines of code.
Take a look at our Qdrant examples on GitHub
Rustlet point_futures = data.into_iter().map(|(sourcetext, embedding)| { encrypt_to_point(ironcore.clone(), &metadata, sourcetext, embedding) });
Qdrant encryption code snippet
Other vector database security assessments:
- Pinecone
- Qdrant - this page