Hacker News new | ask | show | jobs
by gorbypark 747 days ago
I think OP is saying that the algorithm is designed in such a way it can match “visually similar” photos/content. The idea is you can’t just rescale, crop or otherwise slightly change a photo as you could if this was just a regular SHA hash of the file. Now they are saying that it might be possible to create an “evil hash” of a document that could match a large percentage of documents, because the hash algorithm obviously doesn’t have enough bits to actually represent the content. So if you have a hash of “white document with some black text” (for example, if looking for image scans of documents) and add this to the db of “watched” hashes, you could in theory hoover up documents.

A quick search didn’t lead me to any proof of concepts about this idea but on the surface (I don’t have any knowledge of the hashing algorithm used in these content filters) it seems like a plausible idea, depending on a lot of factors.

1 comments

I'm not sure that this would work, mainly because it would mean that all the documents would have the same hash, rendering the content ID system useless. I don't know how many bits a content I'd contains, though, but I imagine it's enough to avoid having too many collisions (as that would reduce the usefulness of the system).
Right. That is the OPs point, however (as far as I can tell). They think it will be intentionally abused by governments as a way of collecting data by mandating certain hashes that will intentionally have many collisions.