Hacker News new | ask | show | jobs
by lsh123 686 days ago
If we assume that server is “evil” then the server can store both PIR encrypted and plain text phone number in the same row in the database and when this row is read, simply log plain text phone number. What do I miss here? We can send PIR request and trust server not to do the above; or we can send plain text phone number and trust server not to log it — what’s the difference?
3 comments

A very simple PIR scheme on top of homomorphic encryption that supports multiplying with a plaintext and homomorphic addition, would look like this:

The client one-hot-encodes the query: Enc(0), Enc(1), Enc(0). The server has 3 values: x, y, z. Now the server computes: Enc(0) * x + Enc(1) * y + Enc(0) * z == Enc(y). Client can decrypt Enc(y) and get the value y. Server received three ciphertexts, but does not know which one of them was encryption of zero or one, because the multiplications and additions that the server did, never leak the underlying value.

This gives some intuition on how PIR works, actual schemes are more efficient.

[Disclosure: I work on the team responsible for the feature]

Does the server reads specific rows from spam numbers DB or the whole database?
In this PIR model the server has to read the whole database, otherwise it would be easy on the server to see, that these rows were not accessed and therefore they are not the one the client queried.

In this PIR model the server runtime is O(n) where n is the number of rows.

To keep it practical, we do support sharding the database. Client leaks a few bits of hashed query to pick the right shard, where we process the entire shard. There is a inherent privacy-performance tradeoff: less shards = less leakage vs more shards = better performance & less privacy.

Thanks for explanation. I will read the code to see how sharing works.
https://github.com/apple/swift-homomorphic-encryption/blob/3...

Run SHA256 on the keyword, truncate the hash and take the modulus with number of shards.

The server never gets the plaintext at all. It only ever receives encrypted data that it cannot read.
I think OP is talking about the set of “spam phone numbers” stored on the server and looking at side channels based on what data is looked up by processing the query.
Exactly, plain text phone number in the same db row
Not sure if its implemented exactly like this but one could imagine that the server must use the client's public key to encrypt the database into a blob it can't decrypt. The encrypted input is used to read the encrypted blob in a way the server can't understand to construct a result the server cannot understand. The result blob is sent to the client which can decrypt it.
Yeah but as I wrote elsewhere, the DB isn’t a KV store of plain text numbers and their encrypted representation. Instead the entire database would be encrypted and you’d do set containment operations in encrypted space which wouldn’t /couldn’t leak anything about your query (modulo unexpected side channels in the design).

I don’t know how they do this efficiently and at scale with lots of updates, but maybe this database is kinda small to begin with anyway and the updates are reasonably cheap to process relative to how many spam numbers are out there.

That could only work if the server didn't have its own database copy. Not sure how a client would be able to provide the server with a database encrypted by the client.

If the server can decrypt it, it's not really safe if you're assuming server is evil

The database isn’t secret here. The server indeed has its own copy - it would have to otherwise what is the client query resolving against. What’s secret is which phone numbers are contacting the client. So instead of sending the phone number to the server, you send an encrypted version of the phone numbers. This encrypted version is then checked against the encrypted database. This prevents the evil server from discovering the phone number the client is checking.

If you read the docs, a perfectly valid implementation is an HTTP request that sends the unencrypted database to the client which then checks the numbers locally - it achieves equivalent security priorities. The advantage here is that the database can be large enough to make distribution less practical than just doing a lookup per number and that’s where the HE comes in.

Remember: evil in a security context means someone trying to actively circumvent your protection guarantees, but you’re making an assumption that the database needs to be secret when it may not as the privacy and security guarantees are about the client’s information. Apple isn’t necessarily saying the database is secret since it’s just “this phone number is likely spam”. Of course, it’s possible that the server itself can’t even generate a valid query. It’s possible Apple designed it such that the query has to be generated on a valid Apple device to begin with (since it has a chain of trust to each device manufactured).

The client don't have to do that.

That's the whole point of Homomorphic Encryption. There is a Wikipedia article for that.

https://en.wikipedia.org/wiki/Private_information_retrieval

That’s not what I saw in the code but I didn’t spend much time so I might be wrong. I’ll check it more carefully later. But if this indeed is whole DB then it’s very limited use case.
It’s a lot more complicated because the phone numbers themselves are stored encrypted and there’s not a 1:1 mapping between encrypted representation and the mapping. So processing the query is actually blinding the evil server afaik.
Evil server stores BOTH encrypted and plain text phone number in the same db row
There’s no single encrypted representation of a phone number. Rather the entire database is encrypted and the accesses performed by the HE algorithm would be randomly accessing the database in a way that wouldn’t leak anything. Now of course if you have billions of lookups a day maybe something does end up leaking because you’d be able to extract networks from data patterns (ie if the same number is contacting 100 numbers, the data access patterns might be the same) but it’s a lot more complicated than what you’re proposing and I wouldn’t be surprised if this is explicitly handled in the design of the feature.