Can the blind index leak information?
One possible way to extract information from the indexed documents is to perform frequency analysis on the hashed tokens in the index. If an attacker has some idea of the kind of data each document contains (language, subject matter, etc.), they can make some guesses about more common tokens, find more common hashes, and attempt to match them. Taking pairs of tokens (or longer groups) together can allow an attacker to establish matches between common phrases and groups of hashed tokens, improving their ability to extract information from the index.
Using a separate salt for each different field provides less frequency data across the entire set of indexed documents. Depending on the use case, additional steps can provide additional protection against frequency analysis. Each of these can produce false positives and may affect the ordering of search results, but provides better security by reducing the information leakage.
Can someone recover document data from the blind index?
Using frequency analysis, an attacker may be able to guess the likely value of one or more commonly occurring words, such as “the” (if stop words are off). With a likely candidate for a known word in hand, the attacker can brute force attack the key space by hashing over possible keys and the known word looking for a match against the hash.
Because of this attack, the tokens themselves should be considered sensitive, but leaking them does not necessarily compromise the data. Assuming a sufficiently large key, which we require, a brute force attack would require computing power and time that is infeasible today.
Yet even if the attack were successful and the key found, that key would only unlock one field for one index and one tenant if a multi-tenant system, and the attacker would need to generate a rainbow table using that key to extract the actual tokens from the hashes in the index.
How do you frustrate frequency analysis and other leak attacks?
- Frequency Suppression: We remove duplicate tokens in a given field, so if the token for “security” shows up repeatedly, we reduce it to a single instance.
- Hash Truncation: We truncate the hashes so that collisions can happen, so an analysis cannot assume that a single token associates with a single word.
- Random Ordering: We shuffle the order of tokens, so knowing something about the document or expected words at specific positions does not directly leak data.
- Multiple Keys: We use per-index and per-field (and, if applicable, per-tenant) keys so, for example, the hash for “WordA” in the
titlefield is different from the hash for “WordA” in the
- Stop Words: We optionally exclude indexing of very common words.
- Phonetic Normalization: We offer optional phonetic normalization for English so that common misspellings of words and names are considered the same.
- Shingles and N-gram Matches: For fields with shingles and/or n-gram matching enabled, many additional tokens are created, and they can be intermingled with the single word tokens, making frequency analysis much more difficult.
Does that impact matches or result rankings?
Matches are generally the same, but rankings can change, especially since we remove or reduce the information on the number of times a word shows up in a document and on the proximity of different words. We provide configuration options to tune for security or relevancy as desired, but our default settings provide generally good results that are close to the results without encryption.
Can we only do exact matching?
Yes and no. The simple construction above only allows exact matching, but this construction can be extended to compute the hashes over substrings, prefixes, etc. This enables many more types of searches. The catch is that the desired options must be known at index time, because the tokens must be produced and stored. These techniques also increase the amount of data stored in the index for each document.
Where is the actual text stored?
Since the values have been swapped out for hashes, and since the document presumably contains sensitive data, the full text is encrypted and stored as well. We use a random key and AES-256 GCM to encrypt each document. The key is wrapped using another tenant-specific secret key and stored along with the document. This allows us to fully recreate fields from documents that have matched a search query.
Can we change tokenization methods?
This can be done, but it requires that each document be decrypted and re-indexed, which can be time-consuming.
Was this page helpful?