Hacker News new | ask | show | jobs
by siilats 747 days ago
It’s the easiest thing for intelligence agencies to scan all your messages. They just need to submit a few million fake “content id” hashes and automatically your phone will share the images that match. Nobody can tell if content id has is of a photo of a document or a photo of a person it’s just a 256byte hash. This is so easily abused. I bet the way it’s implemented it doesn’t have enough resolution to read text so one evil content id hash will match any photo of any document or screenshot you have taken. So essentially your WhatsApp client will send every screenshot of a text document to nsa.
1 comments

A few million fake "content id" hashes still gives them a 1/100000000000000000000000000000000000000000000000000000000000000000000000000000 chance that any one of those hashes will match. 256 bits are a lot, besides, what are they going to do? Sift through random vacation photos that happened to match this "ten jackpots in a row" chance?
I think OP is saying that the algorithm is designed in such a way it can match “visually similar” photos/content. The idea is you can’t just rescale, crop or otherwise slightly change a photo as you could if this was just a regular SHA hash of the file. Now they are saying that it might be possible to create an “evil hash” of a document that could match a large percentage of documents, because the hash algorithm obviously doesn’t have enough bits to actually represent the content. So if you have a hash of “white document with some black text” (for example, if looking for image scans of documents) and add this to the db of “watched” hashes, you could in theory hoover up documents.

A quick search didn’t lead me to any proof of concepts about this idea but on the surface (I don’t have any knowledge of the hashing algorithm used in these content filters) it seems like a plausible idea, depending on a lot of factors.

I'm not sure that this would work, mainly because it would mean that all the documents would have the same hash, rendering the content ID system useless. I don't know how many bits a content I'd contains, though, but I imagine it's enough to avoid having too many collisions (as that would reduce the usefulness of the system).
Right. That is the OPs point, however (as far as I can tell). They think it will be intentionally abused by governments as a way of collecting data by mandating certain hashes that will intentionally have many collisions.