Hacker News new | ask | show | jobs
by baystep 3211 days ago
As someone who is going to implement "find-my-friends" in this way, and wants to do it conscientiously, may I ask why only the first 10 bytes of the SHA1? Isn't that just an incomplete hash that could collide with other valid numbers?
5 comments

That's the point. You hash the number, and keep the hash. Send the first few digits of the hash to the server, it sends you a bunch of full hashes back. If your hash is in that list, then you know the server knows that person.
But that sounds like un-necessary round trips. I feel like the point in an API or backend service doing this, is because it knows better and has more computing power. I'd rather have my service take an incoming list and return with whether it's a member or not. In fact returning full hashes seems even worse, since now the server has not only replied with the correct match, but potentially more strangers. You could milk an API of it's users much quicker that way.
Yes, but the server doesn't know which of the responses is actually your friend, thus preserving privacy.
Right, but I'm thinking the server still needs to be able to say "This person who ever they are is definitely the number you have". Like, I wouldn't check my password hashes with only the first 10 bytes, I guess that's why I'm still stuck on the truncated idea.
No, the server just says "I know these email addresses starting with 'foo': foo@example.com, foo2@gmail.com, foo3@hotmail.com". If your friend's email address is in there, you know they have Signal.

The hashing is so people can't easily get a list of all the phone numbers, which is easy to work around, but, then again, they could just hammer the endpoint querying for all the various numbers anyway.

Signal way: Server keeps tokens of registered users cached. Client makes tokens of all the numbers it has, remembering the token -> number map for them. Client sends tokens to server, server returns the subset that matched. Client knows which numbers match. Making the token 10 bytes rather than 20 just saves bandwidth.

Silent Circle way: Server keeps hashes of registered users cached. Client hashes all the numbers it has, remembering the hash -> number map for them. Client sends a small number of the most significant bits of its hashes. Server treats this like a mask or wildcard search, returns all matches. Client knows which match, also gets hashes for other users they don't know. Sure, the client can reverse them, but the client could have probed for them anyway. The server maybe/probably doesn't learn enough to preimage what the client sent because of too many collisions. The downside is that the number of bits the client sends needs to be appropriate given the size of the database, though that can be mitigated by sharding by country code/area code/whatever.

I've never heard of a way to do this without giving all your contacts to the app. Hashing is pretty useless if an attacker steals the DB as the parent pointed out because you can pretty easily build a dictionary of all possible 10-digit US phone number hashes (especially since area codes aren't anywhere close to randomly distributed). One slight advantage of hashing is non-malicious but curious employees/engineers at the company could be tempted to look at contact info whereas reversing hashes is much more clearly over the line and not everyone will know how to do it, but that's more a matter of training employees not to do it either way.
It doesn't matter if the attacker steals the database of hashes because the database of registered users is also stored there in plaintext. The attacker/insiders would need to start logging the contact intersection requests, including the authenticated user making them, in order to see a social graph and how that changes over time. Cracking the hashes is only necessary to unmask the phone numbers.

Standard OTT messaging architecture guarantees the service will see message envelopes anyway, so it's not worth the trade off of deploying PIR schemes of the differential or computational variety. Look for stuff by Ian Goldberg. Percy++ is a practical example you can run yourself.

For OTT contact sync the reasonable thing to do is just send the phone numbers. Blinding them by truncated hash is a nice gesture. What's not cool is sending all the other fields of the address book along with it.

We know and can easily verify that Signal is being good. But what can we do about less trustworthy services? The phone would need to apply permissions. Like, allow/deny filters of which fields of contacts or contact categories each app may have access to. An address book firewall, essentially. Considering how important messaging apps are in our lives and the amount of time we spend with them I think such granularity is warranted.

You're missing the point. "Hashing" isn't the interesting word in "truncated hashing". "Truncated" is.
I think it's much less likely to have collisions because the input space is so restricted (just valid phone numbers).
SHA-1 has approximately Random distribution, so it doesn't matter what the input is really in practice.
The point of the person you are responding to essentially comes down to pidgeon hole principle (the best principle), and absolutely applies to a cryptographically random function, and is agnostic to the input distribution: it is a relative measure of information/entropy between two sets.
I agree with what you said, but I don't agree that's what the person implied by their post.
Git defaults to using 7 hex characters for short hashs and doesn't see collisions for 95+% of repos. Even the Linux kernel repo with all its objects only needs 12 characters to ensure uniqueness.
Yes, but if (it seems more like when) a collision happens in a situation like this, aren't we essentially saying someone is connected with someone they don't even know? I feel like detecting the collision takes more computation then just sending the whole hash right? I mean in consideration of the scale that Signal/WhatsApp/etc are with millions of phone numbers
With a couple GPUs you could find out if there is a collision on phone numbers with truncated hashes pretty quickly. If the phone numbers are normalized, except for a few edge cases, the space is twelve digits which can be naively brute forced. A trillion SHA-1 hashes isn't that hard to do these days.