How It Works
Cloaked AI implements a form of property preserving encryption to encrypt vector embeddings in such a way that they can still be used for the common use cases of nearest neighbor search and clustering based on their similarity to other vectors. The property that is preserved by the encryption is the distance between vectors. Like other forms of property-preserving encryption, a tradeoff is made between the most secure protection of the data and the usability of the data in its encrypted form.
The encryption technique used in Cloaked AI is based on a paper Approximate Distance-Comparison-Preserving Symmetric Encryption by Georg Fuchsbauer, Riddhi Ghosal, Nathan Hauke, & Adam O’Neill, first released on eprint, and later published in the proceedings of the 2022 Security and Cryptography for Networks conference.
Encrypting Vector Elements
The core idea of the technique is to scale the elements of the vector by a secret factor, then to perturb the elements of the scaled vector by adding a random vector to it. This perturbation is pseudorandom, based on a secret key. The idea is that the amount of perturbation of each individual element is uniformly distributed within a range defined by a value that we refer to as the approximation factor. Conceptually, for a vector with n elements, the encryption can be viewed as choosing a new point that lies within an n-dimensional sphere centered on the point represented by the vector whose radius is determined by the scaling factor and approximation factor. The choice of the point is pseudorandom rather than fully random so that the encrypted vector can be decrypted given the same key.
The values of the scaling factor and the key used for the pseudorandom function are the same for all vectors in the same data segment (such as all the vectors for one tenant in a multi-tenant SaaS system), but different values are used for each segment, to make it more difficult to extract information from the encrypted vector data by analyzing clusters or other statistical measures.
The larger the approximation factor, the more difficult it is to guess what the original point was, but the less accurate the results of nearest neighbor searches are. The choice of approximation factor must balance the need for security with the need for accuracy. The factor is a configuration parameter for the SDK - the value chosen should reflect the range of values that each element of the vector can have; we currently assume that the distribution of values is the same for each element of the vector, which we have observed to be the case for several different embedding models used in AI. If you are using a model and would like some guidance on the choice of approximation factor, please contact us via email at firstname.lastname@example.org_ and let us know which model you are using. We’ll help you evaluate the range of values that will be reasonable for the approximation factor, depending on your relative priority of security and search accuracy.
Shuffling Vector Elements
After the vector has been encrypted, we reorder the elements of the vector by applying a deterministic shuffling algorithm that is based on a secret value. Like the choice of scaling factor and secret used to encrypt the vector elemetns, the same shuffling secret is used for all the vectors in the same data segment, but a different secret is used for each segment.
Encrypting Accompanying Data
There is often additional data (metadata) that is associated with a vector embedding, represented as one or more additional fields. These metadata fields might also be sensitive and need to be protected. The toolkit that includes the vector encryption routines also includes methods for encrypting the metadata. If a field might be used in searches to identify vectors, it should be encrypted with determinisitic encryption, another form of property-preserving encryption that allows exact match searches of the encrypted fields. If searches using the field are not required, it can be encrypted with standard encryption.
Was this page helpful?