1. Docs

What is Encrypted Search?

In short, the ability to search over encrypted documents to find the encrypted data you want.

More specifically: it’s a way to protect data that has been indexed for search. Neither the search service, its admins, nor hackers with access to the service can extract information about what is stored or about incoming searches without expending an extreme amount of effort. Everything is meaningless without the right key(s).

Landscape of Approaches

Encrypted Search is a deeply studied area of cryptography with thousands of papers, approaches, and attacks. The approaches range from deterministic encryption where a word is always encrypted to the same value to more random approaches. Some of the approaches hide how many results link to a specific token.

Many of the approaches – some of which are the most theoretically secure – suffer from performance and usability issues. For example, they may not support multi-term queries, boolean searches, or they may require many extra disk lookups.

“Leakage”

Encrypted search schemes tend to suffer from “leakage” of varying degrees. Leakage is where an attacker with access to the data store or with visibility into the flow of encrypted queries and results can learn things about the data, such as which searches are most popular or which words show up most often in the data. Learnings like this are called leakage and they may allow an attacker to at least partially reverse an index.

Existing search services like Elasticsearch and OpenSearch are very good at what they do. They’re efficient, they scale, and they have a zillion knobs that handle sharding, redundancy, and every manner of search query you’ve ever dreamed of. And these search services are in widespread use.

One of our goals is to make better security as adoptable as possible with little or no work by the adopter. We want to maximize usability and level-up security.

For these reasons, we chose a form of Searchable Symmetric Encryption known as a Blind Index. It’s fast, efficient, can be used to construct complex search queries, and most importantly, it can work with an existing search service. That means someone already using the search service doesn’t have to change their code or take a leap of faith on a new search service to get started. They just need to connect to the search service through a proxy.

As with all deterministic encryption approaches, a Blind Index is subject to some leakage. We understand this trade-off and we have also worked hard to minimize it. To that end, we’ve added a number of features that frustrate this sort of attack.

Anatomy of a Blind Index

Search Index Terminology: an index contains many documents, which each contain many fields, which each contain many words. These words get mapped one-to-one to encrypted tokens.

A blind index is straightforward to create and use. A piece of data (a document) to be indexed is prepared as it would be for a typical search index, by extracting a set of tokens from each field in the document. The indexer then uses a secret key to create a deterministic hash for each token, and it stores these hashes in its document index instead of the words themselves. A search operation similarly tokenizes the query and generates hashes using the same secret key, and it then looks for the hashes instead of the tokens in its index. Let’s take a look at an example:

Say we had a name in our dataset, such as “Jim Bridger”. Instead of storing those two tokens, we could hash them using a secret key only known by the calling application to produce two hashes, like 570B5079 9DB219AB2. If we store those in the index, it is practically impossible for an attacker to determine what the name is. When we want to search, the calling application takes the search input, tokenizes it and hashes the tokens in the same way. So a search on “Jim” would become a search for 570b5079, for example.

Blind Index document example

The security of this scheme relies on a couple of cryptographic guarantees. One is that a sufficiently strong hashing algorithm is used, so that it is computationally infeasible to recover the original token from a hash. The other is that a keyed hashing algorithm is used (HMAC). This is a security measure to make it more difficult to precompute hashes from common tokens.

Keep Reading

Technical consult

Building systems with secure data can be challenging. We are experts in this area and are happy to talk with you to help. Fill out a form or jump into our community to talk with us.

Was this page helpful?