2025-12-15

Patrick Walsh

One Unchecked Box, One Billion Records: The Human Error Problem

The misconfiguration epidemic that training can’t fix

Here’s a stat that should terrify every CISO: according to IBM’s 2025 Cost of Data Breach Report, human error now accounts for 26% of all data breaches, up from 22-24% just a year ago. The Verizon DBIR 2025 puts it even higher, reporting that the human element was involved in 60% of breaches. And Mimecast puts the number at 95%.

We spend billions on security awareness training, automated scanning tools, and compliance frameworks. Yet somehow, the “oops” factor keeps getting worse.

Nowhere is this more visible than in the graveyard of exposed Elasticsearch and OpenSearch instances. Since 2022, misconfigured search databases have leaked over 17 billion records¹, including data from national police databases and plaintext government passwords.

The pattern is almost comically consistent: deploy Elasticsearch without authentication, expose it to the internet, get indexed by Shodan within hours, and wake up to a billion-record breach.

The question isn’t whether your team will make mistakes. It’s whether those mistakes will be learning moments or catastrophes.

The human element isn’t getting better

If you think more training is the answer, I have bad news. Despite years of security awareness programs, the numbers are moving in the wrong direction.

Source	Key Finding	Impact
IBM 2025	Human error: 22% → 26%	Avg breach cost: $4.44M
Verizon DBIR 2025	Human element in 60% of breaches	Third-party breaches doubled to 30%
Ponemon 2025	Negligent insiders: 55%	$8.8M annual cost from negligence

The problem is structural, not individual. We’ve built security models where a single checkbox mistake, like forgetting to enable authentication, exposing the wrong port, or using default settings, can instantly compromise millions of records.

Principal Skinner Ok GIF - Find & Share on GIPHY

And the worst part? Many of these failures happen on systems that claim to be “secure by default.”

The Elasticsearch exposure epidemic

Elasticsearch and OpenSearch are phenomenal technologies for full-text search, log analysis, and data analytics. But their flexibility comes with a catch: security was historically optional.

Versions before 8.0 (released in 2022) shipped without authentication enabled by default. If you accidentally expose ports 9200 or 5601 to the internet without proper firewall rules, you’re essentially putting up a neon sign that says “free data here.”

Tools like Shodan actively scan the internet for exposed services. An unsecured Elasticsearch instance can be discovered, indexed, and exfiltrated within hours of going online.

Let me show you what that looks like in practice.

Case study 1: Shanghai Police; 1 billion citizen records

In July 2022, a hacker known as “ChinaDan” posted a sales listing on a dark web forum: 1 billion records from the Shanghai National Police Database for 10 Bitcoin (about US$200k at the time).

The data included full names, national ID numbers, birthdays, police case records, and home addresses. Roughly 70% of China’s population was in this database.

The cause? An unprotected Kibana dashboard running on an outdated Elasticsearch 5.5.3 instance. No password. No firewall. Just open to the public internet.

The developer who configured it likely didn’t realize Elasticsearch security wasn’t enabled by default in version 5.x. One deployment oversight created what may be the largest police data breach in history.

Case study 2: DarkBeam; 3.8 billion credentials exposed

In September 2023, security researcher Bob Diachenko discovered an exposed Elasticsearch cluster belonging to DarkBeam, a digital risk protection company. The database contained 3.8 billion records, including email and password combinations collected from previous data breaches.

This wasn’t stolen user data; DarkBeam legitimately aggregates breach data for monitoring services. But because their Elasticsearch instance and Kibana interface were left open without authentication, anyone could access the entire dataset, including credentials from government agencies and Fortune 500 companies.

Diachenko noted that DarkBeam fixed the issue quickly once notified, but Shodan had already indexed it.

A company whose business model is “protecting you from data breaches” exposed billions of credentials because they misconfigured their own infrastructure. That irony should be studied in textbooks.

Jimmy Fallon Reaction GIF by The Tonight Show Starring Jimmy Fallon - Find & Share on GIPHY

Case study 3: The 2025 mega-compilation; 6.19 billion records

In October 2025, researchers discovered what may be the largest single data exposure ever: 6.19 billion records in a misconfigured Elasticsearch instance with no authentication. The database appeared to be a compilation of multiple breaches, containing PII, bank account data, passport numbers, and government IDs from multiple countries.

Same story, bigger numbers: No password, default ports exposed, and Shodan found it first.

Case study 4: Microsoft; 250m

The last one to mention came from Microsoft, proving that even companies with huge security budgets can run into these problems. They put five Elasticsearch servers on the public Internet containing 250 million customer support records, including “email addresses, IP addresses, locations, descriptions of CSS claims and cases, Microsoft support agent emails, internal notes marked as “confidential,” and case numbers, resolutions, and remarks,” according to SecurityWeek.

But what makes the Microsoft misconfiguration notable (and still relevant although this one dates from 2020) is the statement put out by the Microsoft Security Response Team:

Misconfigurations are, unfortunately, a common error across the industry. We have solutions to help prevent this kind of mistake, but unfortunately, they were not enabled for this database.

In other words, cloud systems are difficult to get right, and the systems that watch them are no less difficult or error-prone.

Why this keeps happening

If you’re thinking, “How do professionals keep making the same mistake?” you’re asking the right question.

Modern infrastructure is absurdly complex. No one can hold everything in their mind and reason it all out. Worse, security is often opt-in rather than default. Here’s the typical sequence:

Developer spins up Elasticsearch for testing (security disabled by default)
Testing instance gets promoted to production with the same config
Port 9200 ends up internet-accessible through misconfigured firewall rules
No alerts fire because the misconfiguration isn’t in the scanner’s scope
Shodan indexes the instance within 24-48 hours
Attackers find it and exfiltrate data in bulk

Harvey Specter GIF by Suits - Find & Share on GIPHY

Security by default should be the norm, not a feature you enable later.

AI to the rescue?

At this point, you might be wondering: if humans keep making these mistakes despite training, tools, and best practices, why not remove them from the equation entirely? Why not use AI to do these tasks?

This sounds good, but it’s wishful thinking. Agentic AI uses probabilistic and non-deterministic models, which is a fancy way of saying they don’t do the same thing every time. Given the same task 10 times in a row, AI is likely to mess it up at least once.

Perhaps in the future we’ll consider AI reliable, free of hallucinations, immune to prompt injection, and so forth, but until then, humans are still the best option we have. So if we can’t eliminate human error and we can’t automate it away, what’s left? The answer: design systems where mistakes don’t matter.

Making mistakes survivable

Of course, you should enable authentication on Elasticsearch, properly configure your firewalls, and train your team.

But history shows that you won’t do it perfectly, every time, forever. No one does.

The real question is: how do we design systems that prevent inevitable mistakes from resulting in catastrophic data breaches?

The answer is application-layer encryption, where you encrypt data before sending it to a storage layer or search service. That’s what Cloaked Search does for Elasticsearch and OpenSearch.

Cloaked Search sits as a transparent proxy between your application and your search cluster:

You selectively encrypt sensitive fields and their associated indices before forwarding. Even admins can’t learn anything about the protected data in Elasticsearch/OpenSearch. And queries are encrypted so that the encryption-in-use scheme protects the data all the way back to your application.

Now, if a misconfiguration happens, the attackers only get encrypted gibberish. Even if they download the entire database. Without the encryption keys (which Elasticsearch never has), the data is useless.

You still get all major search functionality like phrase searches, wildcards, type-ahead, boolean queries, and more. For multi-tenant applications, you can encrypt per-tenant. Learn more in our documentation or try it yourself.

The better security model

The traditional security model assumes you can prevent breaches with enough diligence. Firewalls, access controls, monitoring, and training are layers intended to stop bad things from happening. But none of those consider the scenario where a breach does occur. What if an insider is malicious? What if a vulnerability is exploited? What if an internal service is exposed to the Internet?

Breaches are increasing, and human error is increasingly a main cause. The layered encryption model simplifies the problem: assume breaches are inevitable, and design systems where unauthorized system access doesn’t result in data theft.

With encryption: “We had a misconfiguration. We fixed it. No data was exposed.” Without: “We need to notify 3.8 billion people.”

The next big Elasticsearch oopsie is coming soon. The only question is which outcome you’ve designed for.

If you’re running Elasticsearch or OpenSearch, talk to us about Cloaked Search. We’ll show you how encryption can work transparently in your stack without breaking functionality or requiring massive rewrites. And if you’re doing vector searches or hybrid searches, we have you covered for those, too.

Our cited use cases come to over 11 billion records, but here’s a more thorough, but still incomplete, accounting that totals over 17 billion in the last 3 years:

Year	Incident	Records	Cause
2025	Massive Compilation Leak	6.19 billion	Misconfigured ES, no auth
2025	Chinese Surveillance DB	4 billion	Unprotected ES instance
2025	Credentials Database	184 million	Unsecured ES, no encryption
2025	Swedish Citizens Leak	100 million	No auth, no firewall
2024	Catholic Health/Serviceaide	483,126	ES accessible without auth
2024	GS-JJ Military Data	300,000	Unsecured ES instance
2023	DarkBeam	3.8 billion	Misconfigured ES + Kibana
2023	Kid Security App	300 million	No ES/Logstash auth
2022	Shanghai Police	1 billion	Unprotected Kibana on ES 5.5.3
2022	StoreHub Malaysia	1.7 billion	Misconfigured ES on AWS
2022	Thomson Reuters	3+ TB	Public ES without password
2022	ENCollect Debt Collection	1.68 million	Open ES server
	TOTAL	more than 17 billion

Explainers

Recent Blogs