Hacker News new | ask | show | jobs
by teo_zero 339 days ago
I think the opening example involving Google is misleading. When I hear "Google" I think "search the web".

The articles is about getting an input encrypted with key k, processing it without decrypting it, and sending back an output that is encrypted with key k, too. Now it looks to me that the whole input must be encrypted with key k. But in the search example, the inputs include a query (which could be encrypted with key k) and a multi-terabyte database of pre-digested information that's Google's whole selling point, and there's no way this database could be encrypted with key k.

In other words this technique can be used when you have the complete control of all the inputs, and are renting the compute power from a remote host.

Not saying it's not interesting, but the reference to Google can be misunderstood.

3 comments

> Now it looks to me that the whole input must be encrypted with key k. But in the search example, the inputs include a query […] and a multi-terabyte database […]

That’s not the understanding I got from Apple’s CallerID example[0][1]. They don’t seem to be making an encrypted copy of their entire database for each user.

[0]: https://machinelearning.apple.com/research/homomorphic-encry...

[1]: https://machinelearning.apple.com/research/wally-search

They do not explicitly state this fact, but they link to the homomorphic encryption scheme they're using, which works like this. To perform an operation between a plaintext value and an encrypted value, you first encrypt the plaintext with the public key and then you can do your operation on the encrypted values to get the encrypted output.

Moreover, even if the details were slightly different, a scheme that reveals absolutely no information about the query while interacting with a database always needs to do a full scan. If some parts remain unread depending on the query, this tells you what the query wasn't. If you're okay with revealing some information, you can also hash the query and take a short prefix of the hash with many colliders, then only scan values with the same hash prefix. This is how browsers typically do safe browsing lookups, but by downloading that subset of the database instead of doing the comparison homomorphically on the server.

Is that what they mean in the Wally paper post by

> In previous private search systems, for each client query, the server must perform at least one expensive cryptographic operation per database entry.

?

Exactly. (I had only looked at the homomorphic encryption post, not the Wally post.) Wally tries to work around this limitation by only using homomorphic encryption for a subset of the database, and reducing the resulting information leakage by using an anonymous network to hide which client is querying which subset. They say this network is operated by a third party, but ultimately you still have to trust that the network operator isn't colluding with the server operator to deanonymize your queries. That's a weaker privacy guarantee, but at least it's not painfully slow.
Homomorphically encrypted services don't need a priori knowledge of the encryption key. That's literally the whole point.

Consider the following (very weak) encryption scheme:

m, k ∈ Z[p], E(m) = m * k mod p, D(c) = c * k⁻¹ mod p

With this, I can implement a service that receives two cyphertexts and computes their encrypted sum, without knowledge of the key k:

E(x) + E(y) = x * k + y * k mod p = (x + y) * k mod p = E(x + y)

Of course, such a service is not too interesting, but if you could devise an algebraic structure that supported sufficiently complex operations on cyphertexts (and with a stronger encryption), then by composing these operations one could implement arbitrarily complex computations.

In the case of searching Google, E(x) is the encrypted query and y is Google's database. Can you compute E(x + y) without doing at least as much work as computing E(y)? I don't think so. Instead, you use public key cryptography so that the server can compute E(y) (yes, encrypting the entire database) without being able to decrypt D(E(x)) = x.
Wrong. In the case of searching, the database is the function that you feed the input query into.

E.g. consider the following system:

E(x) = x ^ k, D(x) = x ^ k

So a one-time pad. Let's say that I provide a service that lets you decide whether a number is even or odd:

IsOdd(E(x)) = E(x) mod 2

You give it an encrypted number, and it gives you back an encrypted bit that you can decrypt to see if the original number was even or not. All while it has zero knowledge of your plaintext number.

A homomorphically encrypted database is just like IsOdd(x), except orders of magnitude more complex. One idea is that any computation can be turned into a Boolean circuit, so if you have homomorphic building blocks for Boolean circuits, you can implement any computation. Obviously, some caveats apply, like all loops have to be unrolled, etc. That's why the whole thing is so inefficient. But mathematically it works.

If I'm generous, that "database" stores a single number. If you transform a more realistic database (one that can store many arbitrary values) into a Boolean circuit, you'll generally end up with at least one operation per stored value. And when you evaluate the circuit on encrypted data, you have to evaluate all those operations every time. That's why I wrote "without doing at least as much work as computing E(y)" instead of just "without computing E(y)." Yes, you do not necessarily explicitly encrypt all stored values. But you'll end up performing a corresponding amount of computation anyways.
> IsOdd(E(x)) = E(x) mod 2

I don't get this. Did you mean

IsOdd(E(x)) = x mod 2

But who provides such function? In the case of a web search,the function is the search engine. I expect Google to provide it, not the user: the refinement and completeness level of its engine is the only reason to choose Google over its competitor. If I have to provide the function, then I'm only buying the compute power from them.

>I don't get this. Did you mean

>IsOdd(E(x)) = x mod 2

No, that's literally the point. IsOdd() operates on the cyphertext E(x). It doesn't see the plaintext x. And yet, due to its algebraic properties, the server can answer the query without decrypting it.

For example:

  Client:
    x = 13
    k = 45
    E(x) = 13 xor 45 = 32

  Server:
    IsOdd(E(x)) = 32 mod 2 = 0

  Client:
    D(IsOdd(E(x))) = D(0) = 0 xor TrimLength(45, 1) = 1 (we trim the decryption key to the length of the response, which is 1bit)
So what happens in this case is the server sends you a bit, which you'll either flip or not flip, depending on the key k. The server doesn't know whether you're going to flip its answer or not, so it doesn't know if your plaintext number is odd or even.
Now I see it. Thank you.

What I still don't understand is how this can be used to have the remote host give me responses that I couldn't have calculated myself locally, because all the data is stored in the cloud, not locally (that's how Google search works).

It's an excellent way of upholding Wirth's Law: software will continue to get slower more rapidly than hardware becomes faster.
I don't know, when I hear Google I hear Gmail, Google docs, and every other service they have to know about people. My mom would probably think mostly about search, but then she would not read an article about HME