Hacker News new | ask | show | jobs
by mm983 1774 days ago
This is painful to read with the arrogant undertone while you don't actually know how CSAM scanning works.

It's fuzzy hash, not ""AI"", based. Cloudflare uses it too, and last time i checked the web is still functional.

https://support.cloudflare.com/hc/en-us/articles/36004610611...

How about we turn the tables and have the complainers suggest a solution. Because every single time an approach to targeted child porn takedowns has been suggested, such as datacenter raids which would not affect as many people, someone is screaming about their privacy.

Child abusers evolve and are very happy if law enforcement doesn't.

8 comments

>"have the complainers suggest a solution"

Being worried about privacy, establishing a precedent for scanning my data against a government database, and the risk of false positives with such an insanely emotionally charged crime is more than mere complaining.

The onus should not be on me to justify why this shouldn't be done. This is something new and it is perfectly fine to argue against it without needing to provide an alternative.

That being said, my solution is to continue to follow the process that law enforcement is currently using.

>That being said, my solution is to continue to follow the process that law enforcement is currently using.

If you knew how they approach this you wouldn't be satisfied either

This all boils down to the classic "Liberty vs. Security" dilemma. While I obviously want to see more prosecutions for people who commit these crimes, a value judgement must be made about what it takes to achieve that.
> Child abusers evolve and are very happy if law enforcement doesn't.

Yes. And now they will evolve by developing a simple system to modify pixels in images when they copy and transmit that will easily defeat this hashing system. The only effect this will have is that moral panickers like you will have got everybody's privacy invaded over your moral panic of the day.

> And now they will evolve by developing a simple system to modify pixels in images when they copy and transmit that will easily defeat this hashing system

They don't use simple file hashes to match images, but perceptual hashes. That way they can find modified derivatives of a source image. The problem with this approach, though, is that this is ripe for false positives. Two completely unrelated images can have similar hashes.

could you use multiple perceptual hash functions with different salts, so that collisions would be less likely while allowing derivatives to be detected?
That reduces to just inventing a “fancier” single hash function. This adds no value or security in cryptography; it just makes things slower. I expect the same is true of perceptual hashes.
They aren't just matching exact hash hits, but are using a metric like the hamming distance between hashes to determine if one image is the same as , or a derivative of, another. The data structures that allow for efficient lookups rely on that metric, or another metric, for matching.
Again, you don't appear to know how this works. Look up fuzzy hash, i even mentioned it in the comment.
I was a kid in the 80s and teenager in the 90s. My favorite thing during that time was pirating video games. A game would come out, and it was cracked, usually within hours and often before the game was even released to be sold. That's when "zero day" had a different meaning. All the "warez" ftp sites had a section for 0 day warez. The cryptologists and math brains would come up with new protection methods to protect their IP from being copied. Spend millions, probably billions for all of these projects. Yet, some kid in their basement with a commodore 64 was always able to crack them. Sometimes it would take longer. There were a few that took years, but once figured out, unlocked hundreds of titles previously secured.

This is, and always is, a game of cat and mouse. Law enforcement is always catching up. They are the cryptologists here. They are never ahead, always behind, because they don't know the new protections peddlers are using until they have been in use and later discovered.

No matter what vector you plug, they will use another, and the game continues (sick game). Maybe divide the image into 32 different quadrants and rearrange them, then put them back in the correct order when viewing through a specific image viewer. I'm sure that would bypass whatever detections they've come up in their fuzzy fingerprinting with as the entire image is now different. By the time they catch someone using this, they'll have already moved on to something different, as they always do.

I will never be ok with warrantless searches of my personal property, no matter the reason or justification or subject, and no matter who it is done by (government or private company). And I say that as a survivor of some pretty horrific shit as a kid to the point I fucking tremble with absolute rage when thinking about it 35+ years later. I would be banned from everything for life if I were to honestly state what I would do with these types of people. The movie "Saw" is tame in comparison. I have no compassion or sympathy for these sickos. But when reading world history, I can absolutely see the importance of "innocent until proven guilty" and Blackstone's Ratio "It is better that ten guilty persons escape than that one innocent suffer." Most of human history was the opposite, and it was brutal and full of literal witch hunts. Are we progressing as a species, or regressing in terms of human rights when it comes to technology?

You're spot on about the 0-day comparison. As always, something will arise to let people with the motivation hide illegal images. The problem is how it will be used against everyday citizens, who don't have sophisticated tools and maybe just share images of Hong Kong freedom protests, or books, or anything anti-totalitarian. The emotional appeals about this being about child abuse are absurd on their face because of how easily those people will hide. It's a good thing that some people are able to see through that as a ploy. We shouldn't have to go there and prove our bonafide hatred of abusers every time we justify our right to secure encryption or freedom from surveillance. Doing so almost validates the government's position. Just like in China you would have to say "of COURSE I hate the democracy protesters! I just think..." No, you shouldn't have to take a deep emotional dive into a history of abuse to justify your human right to privacy.
Just saying "fuzzy hash" doesn't begin to explain how this would work. There are an infinite variety of algorithms along with arbitrary tolerances configurable. "fuzzy hash" just isn't helpful no matter how many times you repeat it.
I have a hard time believing that this algorithm will be able to resist simple image manipulations while still being sensitive enough to avoid false positives.
I have similar qualms with this, because this is increasing the number of photos scanned by 2-3 orders of magnitude, and the number of false positives presumably also increases correspondingly.
What algorithm are they using?
I don't know. If they made that public it would be completely ineffective.
Not a single joke was made about child abuse.

The specific subject of child abuse is irrelevant in my commentary. It was a commentary on the general category of AI, used all over, for many things, and more and more every day, but nice try.

Edit: You edited out what I was referring to as I was replying.

Microsoft has been using PhotoDNA [0] to scan OneDrive content for quite some time. The news is that a device manufacturer is doing it on your device with local data. There's some MS research to use machine learning on metadata to identify offending material [1].

[0] https://en.wikipedia.org/wiki/PhotoDNA

[1] https://arxiv.org/pdf/2010.02387.pdf

It makes sense for website owners to scan what's being uploaded to their servers, but that is totally different from scanning what's stored locally on people's devices.
> It's fuzzy hash, not ""AI"", based. Cloudflare uses it too, and last time i checked the web is still functional.

If they're using fuzzy matching with perceptual hashes, then the space that false positives can exist in for each perceptual hash is huge.

So this means it is checking if you are sharing known CP images? That does seem to be much less invasive and problematic as there is likely no good reason to be sharing these images.
It’s not just checking files you are sharing, it’s also scanning files that exist on your device. The worry, or slippery slope argument, is that it’s one step away from scanning your device for other types of content, like memes critical of the government or just general wrongthink.
A lot of governments would be very interested in this feature.
There actually are some edge cases even for matching against image blacklists. Google has experience with hitting them because it's used this type of image simhash for years (for shared cloud files at least).

The definition of child porn varies around the world. These systems use the US definition. This is not entirely what you might expect. For example, in the USA the courts have decided that cartoons can be child porn even though no actual children are in the picture. Most of the world does not agree with this, meaning an image can be CP in one place but not another. Is Apple going to enforce the US definitions or the ones where the user actually lives?

In the USA, photos an under-age person takes of themselves can also be considered CP.

What counts as a "child" for sexual purposes also varies around the world. Some countries have a lower age of consent than other places. In some parts of the world the age of consent and the age at which a child stops being a child for CP purposes are different, meaning that a teenager can have sex legally but if they take a photo of themselves doing it, they are trafficking in CP.

Finally, what is actually on these image blacklists? Hardly anyone actually knows because of the third rail nature of CP. Tech firms are often delivered image hashes, not even the images themselves, by third party 'charities' of various kinds and tech workers are - for obvious reasons - not normally given access to the actual pixels. Additionally, appeals from users are invariably ignored because people say "legal issues, it's complicated" and so everyone clams up. If FPs occur there is no way to resolve it and the people who see your appeal, if there even is one, won't be willing to actually look at the image to find out what it was.

It should be obvious how much potential for abuse this hands the people who actually manage these CP databases. Literally any image can be made verboten immediately, without any recourse, and basically nobody will ever find out including the people who shut down the affected users.

Yes, that's exactly it. It uses a database compiled by NGOs and specialized firms comprising of file hashes matching child porn. These lists are handled by humans.

Fuzzy means that it takes compression and the like into account, because even if just one pixel out of 20 thousand is different, the hash is different too. Fuzzy hash still recognizes it as the same image, so using an algorithm to alter the color etc. won't work.

> These lists are handled by humans.

That's also true for the no-fly list and the Terrorist Screening Database,[1] yet those are full of false positives. And unlike those lists, CSAM databases cannot be independently verified. To do so would require having the original images, which is illegal.

1. https://en.wikipedia.org/wiki/Terrorist_Screening_Database

The no-fly list and the terrorist screening database aren't used in a court of law. The Confrontation Clause of the Sixth Amendment guarantees you access to all the evidence presented against you. You also don't need the original images to defend yourself, though apparently CP can be presented to a (traumatized) jury [0].

So if you're charged on the basis of a fuzzy hash matching, you'd subpoena Apple for the photo in your backup that matched, present it to the court (since it doesn't actually matter if it's CP or not to be admissible), and you win the case.

0. https://www.johntfloyd.com/the-difficulty-with-criminal-evid...

> So this means it is checking if you are sharing known CP images?

No. NeuralMatch was “trained” using 200,000 CP images. “Neural“ is likely a reference to the perceptual matching that it uses. It is not a bit for bit match.

Perceptual matching is a technique used for categorizing images based on characteristics and content.

The algorithm will scan your library containing new information and compare it to what it understands as CP.

There is no reference to fuzzy anything in any of the write ups. I don’t know how you can proclaim that this is a fuzzy hash. Where is your source?