What is Encrypted Search?

In short, the ability to search over encrypted documents to find the encrypted data you want.
More specifically: it’s a way to protect data that has been indexed for search. Neither the search service, its admins, nor hackers with access to the service can extract information about what is stored or about incoming searches without expending an extreme amount of effort. Everything is meaningless without the right key(s).

Landscape of Approaches

Encrypted Search is a deeply studied area of cryptography with thousands of papers, approaches, and attacks. The approaches range from deterministic encryption where a word is always encrypted to the same value to more random approaches. Some of the approaches hide how many results link to a specific token.
Many of the approaches — some of which are the most theoretically secure — suffer from performance and usability issues. For example, they may not support multi-term queries, boolean searches, or they may require many extra disk lookups.

“Leakage”

Encrypted search schemes tend to suffer from “leakage” of varying degrees. Leakage is where an attacker with access to the data store or with visibility into the flow of encrypted queries and results can learn things about the data, such as which searches are most popular or which words show up most often in the data. Learnings like this are called leakage and they may allow an attacker to at least partially reverse an index.

Our Approach

Existing search services like Elasticsearch are very good at what they do. They’re efficient, they scale, and they have a zillion knobs that handle sharding, redundancy, and every manner of search query you’ve ever dreamed of. And Elasticsearch is in widespread use.
One of our goals is to make better security as adoptable as possible with little or no work by the adopter. We want to maximize usability and level-up security.
For these reasons, we chose a form of Searchable Symmetric Encryption known as a Blind Index. It’s fast, efficient, can be used to construct complex search queries, and most importantly, it can work with an existing search service such as Elasticsearch. That means someone already using Elasticsearch doesn’t have to change their code or take a leap of faith on a new search service to get started. They just need to connect to the search service through a proxy.
As with all deterministic encryption approaches, a Blind Index is subject to some leakage. We understand this trade-off and we have also worked hard to minimize it. To that end, we’ve added a number of features that frustrate this sort of attack.

Anatomy of a Blind Index

Search Index Terminology: an index contains many documents, which each contain many fields, which each contain many words. These words get mapped one-to-one to encrypted tokens.
A blind index is straightforward to create and use. A piece of data (a document) to be indexed is prepared as it would be for a typical search index, by extracting a set of tokens from each field in the document. The indexer then uses a secret key to create a deterministic hash for each token, and it stores these hashes in its document index instead of the words themselves. A search operation similarly tokenizes the query and generates hashes using the same secret key, and it then looks for the hashes instead of the tokens in its index. Let’s take a look at an example:
Say we had a name in our dataset, such as “Jim Bridger". Instead of storing those two tokens, we could hash them using a secret key only known by the calling application to produce two hashes, like 570B5079 9DB219AB2. If we store those in the index, it is practically impossible for an attacker to determine what the name is. When we want to search, the calling application takes the search input, tokenizes it and hashes the tokens in the same way. So a search on “Jim” would become a search for 570b5079, for example.
Blind Index document example
The security of this scheme relies on a couple of cryptographic guarantees. One is that a sufficiently strong hashing algorithm is used, so that it is computationally infeasible to recover the original token from a hash. The other is that a keyed hashing algorithm is used (HMAC). This is a security measure to make it more difficult to precompute hashes from common tokens.

Security FAQ

Can the blind index leak information?

One possible way to extract information from the indexed documents is to perform frequency analysis on the hashed tokens in the index. If an attacker has some idea of the kind of data each document contains (language, subject matter, etc.), they can make some guesses about more common tokens, find more common hashes, and attempt to match them. Taking pairs of tokens (or longer groups) together can allow an attacker to establish matches between common phrases and groups of hashed tokens, improving their ability to extract information from the index.
Using a separate salt for each different field provides less frequency data across the entire set of indexed documents. Depending on the use case, additional steps can provide additional protection against frequency analysis. Each of these can produce false positives and may affect the ordering of search results, but provides better security by reducing the information leakage.

Can someone recover document data from the blind index?

Using frequency analysis, an attacker may be able to guess the likely value of one or more commonly occurring words, such as “the” (if stop words are off). With a likely candidate for a known word in hand, the attacker can brute force attack the key space by hashing over possible keys and the known word looking for a match against the hash.
Because of this attack, the tokens themselves should be considered sensitive, but leaking them does not necessarily compromise the data. Assuming a sufficiently large key, which we require, a brute force attack would require computing power and time that is infeasible today.
Yet even if the attack were successful and the key found, that key would only unlock one field for one index and one tenant if a multi-tenant system, and the attacker would need to generate a rainbow table using that key to extract the actual tokens from the hashes in the index.

How do you frustrate frequency analysis and other leak attacks?

  • Frequency Suppression: We remove duplicate words in a given field (or in configurable chunks of a given field) so if the word “security” shows up repeatedly, we reduce it to a single instance per chunk of the text.
  • Hash Truncation: We truncate the hashes so that collisions can happen so an analysis cannot assume that a single token associates with a single word.
  • Added Noise: We add a configurable amount of “noise” — fake tokens — into the stream.
  • Random Ordering: We shuffle the order of words (per chunk of text, which is configurable in size) so knowing something about the document or expected words at specific positions does not directly leak data.
  • Multiple Keys: We use per-index and per-field (and, if applicable, per-tenant) keys so, for exmaple, the hash for WordA in the “title” field is different from the hash for WordA in the “body” field.
  • Stop Words: We optionally exclude indexing of very common words.
  • Phonetic Normalization: We will offer optional phonetic normalization for English so that common misspellings of words and names are considered the same.
  • Phrase and Substring Matches: For fields with phrase and/or substring matching enabled, many more tokens are created, and they are intermingled with the single word tokens, making frequency analysis much more difficult.

Does that impact matches or result rankings?

Matches are generally the same, but rankings can change, especially since we remove or reduce the information on the number of times a word shows up in a document and on the proximity of different words. We provide configuration options to tune for security or relevancy as desired, but our default settings provide generally good results that are close to the results without encryption.

Can we only do exact matching?

Yes and no. The simple construction above only allows exact matching, but this construction can be extended to compute the hashes over substrings, prefixes, etc. This enables many more types of searches. The catch is that the desired options must be known at index time, because the tokens must be produced and stored. These techniques also increase the amount of data stored in the index for each document.

Where is the actual text stored?

Since the values have been swapped out for hashes, and since the document presumably contains sensitive data, the full text is encrypted and stored as well. We use a random key and AES-256 GCM to encrypt each document. The key is wrapped using another tenant-specific secret key and stored along with the document. This allows us to fully recreate fields from documents that have matched a search query.

Can we change tokenization methods?

This can be done, but it requires that each document be decrypted and re-indexed, which can be time-consuming.