2021-10-31 Patrick Walsh
Solving Search Over Encrypted Data
Everyone needs strong data protection. Current “transparent” encryption isn’t cutting it. Every time we see a breach that steals people’s social security numbers, addresses, credit card numbers, and so forth, we have just one question for the company that got hacked:
Encrypting sensitive data before putting it into your database or into your document storage is the best way to protect it if you assume that a bad actor getting into your network is inevitable. Which you should, because it is. This approach to encrypting before storing is called application-layer encryption, and too few companies use it to protect their sensitive data.
In 2017, when Equifax was hacked, it was one of the biggest breaches in history. The attackers downloaded the birthdates, social security numbers, addresses, and other sensitive information on 146 million Americans. The congressional report on the breach said this about the attack:
The attackers were able to use these credentials to access 48 unrelated databases. Attackers sent 9,000 queries on these 48 databases, successfully locating unencrypted personally identifiable information (PII) data 265 times.
They stored sensitive consumer information in plaintext. Unprotected. Repeatedly.
And they’re not the only ones. T-mobile, for example, was hacked earlier this year and the story was the same. So why does this keep happening? There are two answers:
- Our systems (consumer, legal, stock market, etc.) don’t adequately incentivize good security or sufficiently punish bad security.
- There is often a worry that encrypting data will turn it “dark” and make it unusable.
Today we’re going to fix problem number two.
Elasticsearch and its non-identical twin, the recently forked open source project OpenSearch, are the hands-down leaders in search software. Solr comes next. And all of them are underpinned by Lucene, which supplies most of the actual search functionality. These packages are packed with features for scalability, redundancy, customizable rankings of results, expressive queries, and so forth.
IronCore Labs is bringing an encrypted search capability to Elasticsearch and OpenSearch as a drop-in solution. We’re meeting customers where they are and helping them to secure more data without big projects. With IronCore Labs, encrypted search augments standard search capabilities and existing search clusters and clients. Users retain the performance, resiliency, scalability, and power they already enjoy, but they get a huge boost in the security of the data they hold.
IronCore Labs deploys as a proxy that sits in front of the search service. Developers point their search clients to the proxy instead of the underlying service, and everything continues to work much the same as it did before. Ops configures which fields should be encrypted in which indices. When indexing or searching documents, the proxy will automatically protect the configured fields in documents and queries as needed.
The scheme we use is a form of searchable symmetric encryption sometimes called a "blind index.” Essentially, a word is turned into a token using a secure hash function with a key (an HMAC). So a search for the word “report” might get transformed by the proxy into a search for the word “38A6F105” or similar. We also encrypt the source with a per-document random AES256-GCM key, so we can return the original text when requested.
This approach allows us to drop into existing infrastructure and be readily deployable. Because a blind index can be susceptible to frequency analysis, we do a number of things to strengthen the implementation and to make it harder for someone to glean useful information from the index. Details can be found in our documentation.
When it comes to multi-tenant SaaS applications, one of the biggest problems is leakage of data between tenants (customers). It’s cheaper to use the same cluster or even the same index to house the data of multiple customers, but this increases the risk of data bleeding out to other tenants. A search injection attack, for instance, can end up returning far more data than it should.
IronCore Labs' encrypted search proxy hooks into our SaaS Shield product to offer per-tenant encryption with optional customer-held encryption keys. That means each tenant’s data is encrypted with its own key. And we only ever use a single key per request, which means that we ensure one query can only ever return data for one customer (tenant).
And maybe more importantly, customers have the option to control their own sensitive data because it’s encrypted with keys they control. If they hold their own keys, they get audit trails on access, and they can withhold access to their key if they suspect any abuse. Without the key, their data is unusable.
There’s an extremely well-studied area of cryptography that’s focused on the problem of securely finding data: encrypted search. Academia has spawned thousands of papers with numerous approaches, attacks, and refinements.
The goal of an encrypted search system is to make the search service effectively blind to the data it holds. The search service accepts queries and returns matches, but in the process learns little or nothing about the contents of the queries and the contents of the indexed documents.
Because the data is not meaningful to the server holding it, compromising that server doesn’t jeopardize the data. Similarly, problems stemming from curious or malicious administrators are eliminated. And for SaaS companies that hold business data, they can push law enforcement requests to their customers, since the data they could produce for the request would just be random, meaningless bytes without the key (and that key is protected and not exportable, right?).
Unfortunately, encrypted search schemes aren’t perfect. They often leak information about the data they protect, particularly if an attacker can observe queries and responses. There may also be patterns in the data, useful metadata that offers clues such as when a document was added or accessed, and more.
For example, if the search scheme uses deterministic encryption, where a particular word always encrypts to a specific value, then an observer could make probabalistic guesses that reverse the encrypted words back to their unencrypted counterparts. This is often done using a frequency analysis attack. If you’ve every tried to solve a Cryptogram using information about the most used characters, then you’ve done something similar.
Cryptographers have proposed all sorts of ways to counter these attacks. Some of the tricks are practical and some come with steep trade-offs in performance, complexity, and usability.
As we evaluated different approaches, we prioritized usability. This means we wanted to preserve ways to make searches that included multiple disconnected terms, boolean operators, wildcards, and so forth. Being able to rank based on relevancy was also critical.
The world has been treading water in the status quo of unsecured data for years. It’s time to move forward.
Ultimately, your data is most secure if it’s encrypted and it can’t be searched at all. But that thinking has led to the opposite happening: data not being encrypted. And that’s much, much worse. The industry has been left paralyzed and without practical, adoptable solutions.