| HN Mirror

Um, the whole thing definitely is unaudited.

> the attack in this article doesn't work because the attacker doesn't have access to all the hashing algorithms used

As far as I know, there are only two hashing algorithms used: ContentID and "the Facebook one", whose name I don't remember offhand at the moment. ContentID has leaked, been reverse engineered, been published, and been broken. A method to generate collisions in it has been published. The Facebook one has never been secret, and essentially the same method can generate collisions in it. And most users just use ContentID. [On edit: Oh, yeah, Apple has their "Neuralhash" thing.]

Are there others? If there are, I predict they're going to leak just like ContentID did, if they haven't already. You can't keep something like that secret. To actually use such an algorithm, you end up having to distribute code for it to too many places. [On the same edit: that applies to Neuralhash].

I assume you're right that the false positive rate is very low at the moment. Given the way they're done, I don't see how those hashes would match closely by accident. But the whole point of this discussion is that people have now figured out how to raise the false positive rate at will. It's a matter of when, not if, somebody finds a reason to drive the false positive rate way up. Even if that reason is pure lulz.

None of this "texts the police", but it does alert service providers who may delete files, lock accounts, or flag people for further surveillance and heightened suspicion. Much of that is entirely automated. And a lot of the other things you'd use as input, if you were writing a program to decide how suspicious you were, are even more prone to manipulation and false positives.

I believe the service providers also send many of the hits to varous "national clearinghouses", which are supposed to validate them. Those clearinghouses usually aren't the police, but they're close.

But the clearinghouses and the police aren't the main problem the false positives will cause. The main problem is the number of people and topics that disappear from the Internet because of risk scores that end up "too high".