2026-06-29

Patrick Walsh

Announcing VectorLens: See the PII Hiding in Your Vector Embeddings

A Flashlight for the Darkest Corner of Your AI Stack

Today we’re releasing VectorLens, a local CLI that scans the vectors in your database, privately, to find the PII hiding inside them. But to explain why it needs to exist, let me take you back (short on time? skip down ↓ to the tool details).

A few years ago, a prospect came to us with a deceptively simple question: did we have any way to protect semantic search?

It was a fair thing to ask us. We already had an encrypted search proxy for keyword search index and query protection. But semantic search was different, with completely different mechanisms under-the-hood. We dug in.

It turns out, the most common way to do semantic search, which is search based on meaning rather than specific words, is to use something called vector embeddings. We’ve written about these elsewhere, but in short, an embedding is a mathematical representation of the meaning of a piece of text, image, or other input. Feed a sentence into an embedding model and you get back a fixed-length list of a few hundred to a few thousand numbers that hold relative meaning. Vectors with similar meanings are mathematically close together while vectors with different meanings are mathematically distant.

So we started looking into ways to protect those vectors and we found a bunch of options. We did an analysis of which approach would be the most useful for the most people, and we built a product that encrypts vector embeddings. But before we went all-in, we had to answer a more basic question.

Is this a real problem?

Was protecting vectors even worth doing? Does it matter if they’re stolen?

The more we looked, the more obvious the answer became: as people lean harder on AI, they increasingly use semantic search to invisibly pull relevant context into their prompts. That’s what lets you ask an AI about your calendar, your documents, your finances, or last week’s meeting and get a useful answer. This tool powers almost all software AI features. If you’ve heard the term RAG (Retrieval-Augmented Generation), that’s the architecture we’re talking about, and it runs on vector embeddings.

Then we started reading the research on attacks against embeddings, and that’s when it got interesting.

It turns out there are lots of ways to attack these things. The most important is the embedding inversion attack, which takes what looks like a meaningless string of numbers and turns it back into a near-approximation or exact reproduction of the original input. And it works across text, facial recognition, and images. Some techniques require training a dedicated attack model against a specific embedding model while some are general purpose. It’s a surprisingly rich and active area of research.

What looks like a meaningless string of numbers can be turned back into a near-perfect replica of the original.

A wall of numbers resolving into something recognizable

The problem: nobody knew it was a problem

Here’s the part that really got us. As we went around talking to people about this, we found that almost no one knew it was a problem.

We even talked to several CEOs and product managers from vector database companies, which at the time were going gangbusters (this was before every classic DB added vector capabilities). And these were the people whose entire job was to manage this type of data.

One CEO of a well-funded vector database company told us, point blank, that an embedding is “like a hash” and has no security implications if it’s stolen. He had never heard of inversion attacks, and his company was being trusted to store exactly this kind of data. That’s how little known these attacks were.

“It’s like a hash. It has no security implications if it’s stolen.” –Very wrong person

That was a wake-up call. We had the ability to build a solution, but it was for a problem that almost no one knew they had.

These days, the risks of embeddings are more widely recognized, and I’d like to think we had something to do with that. They’ve made it into the OWASP Top 10 for LLMs, and more privacy and security professionals know their vectors are a lurking problem. But “more” is doing a lot of work in that sentence. It’s still a dark area.

In most IT infrastructures, embeddings are a genuine shadow-data blind spot.

The PII-scanning industry doesn’t know what to do with embeddings

Part of the reason is that the Data Security Posture Management (DSPM) tools used for finding and reporting on personally identifiable information just don’t know how to deal with vectors.

DSPM companies have finally figured out that vectors are a problem, but the overwhelming response has been to label the data before it gets turned into a vector. In other words, they scan the text, label that, then copy the label forward to the vector.

That’s fine as far as it goes, but it has a gaping hole: what happens when you find vectors in a database and they aren’t labeled?

A baffled shrug — no idea what this is

Maybe they’re unlabeled because there’s nothing sensitive. Or maybe it’s because the developer didn’t add labeling to the pipeline. You can’t tell just by looking.

And these days, developers are asking AI to build things. If the AI didn’t know to add the labeling, or forgot, it could easily fly into production.

If those vectors aren’t labeled, and they aren’t encrypted, then you’ve quietly created copies of your most sensitive data that are unmonitored, unregulated, and unprotected. That’s a big problem, and it’s growing every single day.

If you use RAG, you almost certainly have sensitive data duplicated into vectors that nobody is watching.

What if you could scan the vectors themselves?

If everyone’s data is being copied into these innocuous-looking numerical representations, which are actually quite dangerous if stolen, then how do we help people figure out whether they even have a problem?

The answer was to build something that empowers security and privacy teams to quantify the problem. Anyone can scan suspicious vectors and learn whether there’s PII and other sensitive data hiding in them.

And so today we’re announcing the release of a tool that does exactly that: it’s called VectorLens.

The VectorLens tool

VectorLens is a command-line tool built for security teams, privacy and GRC folks, and the developers who actually own these pipelines. It runs as a single self-contained binary on macOS and Linux, and the workflow is three steps:

Export embedding vectors you want to scan from your vector database. Our docs have copy-paste snippets for Postgres/pgvector, Pinecone, Weaviate, Milvus, Qdrant, and others. If you can export to JSONL or Parquet, VectorLens can scan it.
Scan them by pointing the tool at your export and specifying which embedding model produced them.
Report on what it found: get a machine-readable JSON report (easily processed by scripts in your pipelines), the inline text output, or a polished PDF you can forward to your boss, your CISO, or your chief data officer.

Here’s roughly what a run looks like:

bash

$ ironcore-vector-lens scan -m all-minilm-l6-v2 jsonl-file --path embeddings.jsonl --report-path report.pdf
11:14:20 No cached lease found, fetching one from the license server
11:14:21 Found supported model 'all-minilm-l6-v2', scanning for PII with it.
11:14:28 Detected 19065 PII embeddings, sampling a few:
11:14:28 Detected ai4p-6609-0 as containing address, email, name, numeric_identifier, phone_number PII.
11:14:28 Detected ai4p-8291-0 as containing address, email, name, phone_number PII.
...
11:14:29 Scan report written to ./ironcore_pii_audit_embeddings.json.
11:14:29
╭────────────────────────┬───────╮
│ total_embeddings       │ 35033 │
├────────────────────────┼───────┤
│ total_pii_embeddings   │ 19065 │
│ address                │ 161   │
│ credit_card_number     │ 12    │
│ email                  │ 5802  │
│ name                   │ 10812 │
│ numeric_identifier     │ 204   │
│ phone_number           │ 850   │
│ social_security_number │ 9     │
│ unspecified            │ 6909  │
╰────────────────────────┴───────╯
11:14:29 54.42% of the scanned embeddings contained PII

And here’s the PDF report from that same run:

A VectorLens PDF report summarizing the PII discovered across a set of scanned embeddings

Once you have a report, you can take action to protect, monitor, and govern vectors:

Add a text-first scanner into your pipeline anywhere you identified gaps, if desired.
Create scheduled runs of VectorLens to do regular scans to see what’s being missed or to augment your DSPM.
Encrypt vectors that hold sensitive information; or encrypt all vectors to be safe.

The point of VectorLens is to turn an abstract risk into data you can actually do something about.

The key difference from every other DSPM tool out there: you don’t need the source data.

It runs entirely on your machine

We made VectorLens free to try, and fully self-serve. You sign up for a license key, which requires an email, but not a credit card. We immediately email you the license key, then you download the binary and run it.

Everything runs locally. IronCore never sees your vectors or anything else private. The only thing the tool does over the network is a lightweight call home to validate your license and then a call to report high-level usage metrics like how many times it’s run, and how many vectors it’s scanned. That’s it.

On Apple Silicon, it uses the built-in Metal GPU; on Linux there are CPU-only builds plus optional CUDA builds if you’ve got NVIDIA hardware and a lot of vectors to get through.

This is an initial release, and we want your feedback

VectorLens supports a specific set of embedding models at launch (all-minilm-l6-v2, bge-m3, gtr-t5-base, text-embedding-ada-002, and text-embedding-3-large), and it’s focused on English for now. We’re adding models regularly, and we are very interested to hear from people about what they want next: which models, which languages, which capabilities would make it better for your use cases.

So please download it, point it at your vectors, and tell us what you think. It’s free to try, and it just might shine a light on some shadow data.

The first step to fixing a shadow-data problem is being able to see it.

Want the deeper background on how these attacks actually work? Start with “Embeddings Aren’t Human Readable” and Other Nonsense and our annotated DEF CON 33 talk, Illuminating the Dark Corners of AI. When you’re ready to do something about it, Cloaked AI Getting Started can help you encrypt your vectors wherever they live.