1. Docs

How Encrypted Search Works

The following is additional technical information about how the blind index search feature is implemented in the Data Control Platform SDKs.

Indexing Data before Encryption

Generating Index Terms

The IronCore SDKs perform a multi-step process to generate index terms from an input string:

  1. The string is first transliterated. This involves converting the string to a form that is more understandable by someone who might speak a different language. This particular transliteration translates Unicode strings into ASCII characters - accents and other modifiers are removed, and characters in other languages are converted. For example, the string “Æneid” is converted to “AEneid”, and “北亰” is converted to “Bei Jing”.
  2. The transliterated string is converted into lowercase letters, and punctuation characters are removed.
  3. This string is split into words on whitespace boundaries.
  4. The set of all possible trigrams is extracted from each word. A trigram is a string of three consecutive characters; for example, the word “gumby” generates the trigrams “gum”, “umb”, and “mby”. The word “of” generates the trigram “of-” (we use the ”-” as padding for shorter strings; this is safe because the ”-” is one of the punctuation characters that was stripped in step 2).
  5. These sets of trigrams are unioned together to form a complete set of terms that represents the input string.

Converting Index Terms to Index Tokens

Once an input string has been processed into a list of index terms, those terms are converted into index tokens. Each term is prefixed by the optional partition name and a salt value that serves as the hash key, and a SHA256 hash is computed over the resulting string. This generates a 32-byte binary value; we convert the first 4 bytes into an unsigned 32-bit integer that is the index token.

Random Padding

Another piece of information that can be leaked if an attacker has access to the index entries is an approximate length of the input string. It isn’t precise, because the tokens are a set (so duplicates are ignored), and because we break the string at word boundaries. However, we do take measures to further hide the length of each piece of data indexed by adding a random number of random 32-bit integer values into the collection of index tokens for a string.

Partitioning

To understand the effects that randomization and the partition ID have on the tokens produced, consider the following code snippet:

Rust
let name = "J. Fred Muggs"; let pii_tokens = pii_index.tokenize_data(&name, None)?; println!("{:?}", pii_tokens); let pii_tokens2 = pii_index.tokenize_data(&name, None)?; println!("{:?}", pii_tokens2); let pii_tokens3 = pii_index.tokenize_data(&name, Some("Part1"))?; println!("{:?}", pii_tokens3);

The string “J. Fred Muggs” produces six index tokens. Due to randomization, each of the lists of tokens will contain at least seven tokens; the lengths might all be different. All of the lists were generated with the same blind index, so they share the same seed value. The first two lists should have six token values in common (it is possible but highly unlikely that there could be a seventh that is the same), since they used the same partition (one with no ID), but it is unlikely that the third list will have any tokens in common with either of the other two lists.

Processing Search Queries

Generating Index Terms

A search query is processed to produce a set of index terms in much the same way that the index tokens were generated when indexing data - transliteration, lowercasing, removing punctuation, splitting into words, generating trigrams, hashing with the partition name and salt. This process does not add any random terms, however.

Once the SDK has generated the set of index tokens, it is the responsibility of the application to search the stored blind index for matching entries. An entry matches if its set of index tokens contains all of the tokens generated for the query string. The application should find all entries that match and return the encrypted data to the client for decryption and processing.

Elimination of False Positives

Because the index tokens are hashes of the index terms and because we truncated the hashes to 32 bits and added random padding, it is possible that the search could return some false positives - that is, strings that had matching index tokens but don’t actually contain the query that was entered. For this reason, the client needs to actually scan the list of matches, decrypt each entry, and eliminate any non-matching entries.

To facilitate this check, the IronCore SDK includes a method that accepts a string and generates the transliterated version. This should be applied to the query string, then applied to each of the decrypted data strings. The client can confirm which of these decrypted transliterated data strings actually contains all of the words in the transliterated query string as substrings.

Was this page helpful?