Hacker News new | ask | show | jobs
by faustomorales 2420 days ago
Hi HN! I'm a data scientist at Thorn, a non-profit dedicated to defending children from sexual abuse. We're excited to open source some elements of our perceptual hashing tooling as a Python package. We've tried to make it very flexible both for ourselves and hopefully also for others. Our aim with is to provide tools that (1) help more people eliminate child sexual abuse material from the internet and (2) assist with common tasks where perceptual hashing can be helpful (e.g., media deduplication). We hope you'll take a look, get some use out of the package (check out the example use cases), and even contribute feedback and/or code to make it better.

For more information on the issue, I urge you to check out our CEO's TED talk here: https://www.thorn.org/blog/time-is-now-eliminate-csam/

Documentation for the package here: https://perception.thorn.engineering/

7 comments

I just wanted to take the opportunity to thank you for what you and your colleagues do. I honestly don't know if I could work in a field like that without just being overwhelmed by it, but I'm really glad that others can and do.
If I may, I'm curious about your thoughts on few things, in the context of use case #1 (abuse), not #2 (reduplication):

* What are the challenges surrounding verification that your system functions properly, given that the test material is illicit?

* Can you speak to the reliability of the system in a sensitivity/specificity kind of way? In other words, what are the false positive and false negative rates?

* Are you aware of any large organizations leveraging your solution?

* Do you feel that the availability of these tools obligates service providers to use them, either morally or legally?

Thanks for asking these important questions!

* What are the challenges surrounding verification that your system functions properly, given that the test material is illicit?

You're right that storing child sexual abuse material (CSAM) is illegal, unless you are the National Center for Missing and Exploited Children (NCMEC) or law enforcement. What is legal is to maintain a hash of known CSAM. NCMEC, Law Enforcement, and large tech companies maintain their own data sets of known CSAM hashes and, where appropriate, share them. The Technology Coalition [1] has more information on this. All that said, we can and do simulate the system to verify that it works properly using bench testing with non-illegal content [2].

* Can you speak to the reliability of the system in a sensitivity/specificity kind of way? In other words, what are the false positive and false negative rates?

The false positive rate in practice is very low. We set our thresholds based on bench tests with an expected false positive rate of less than 1/1000 (the thresholds vary based on which hash function was used). Different hash functions are more resilient to some transformations than others (e.g., cropping, watermarks, etc.).

For the false negative rate, it depends entirely on the kind of modification made to the image. For many common operations, it is close to zero.

* Are you aware of any large organizations leveraging your solution?

Thorn builds technology to defend children from sexual abuse, one of the products we build for this purpose is Safer [3]. Perception provides an easy way to get started using the Safer matching service. Safer provides a more robust and complete solution including handling a queue of content and reporting tools. Some organizations using Safer include Imgur, Flickr, and Slack.

But this technology (perceptual hashing) is used by many companies who don't use our tools. Our goal is just to make it easier for more people to get started.

* Do you feel that the availability of these tools obligates service providers to use them, either morally or legally?

Not being a lawyer or a public policy expert, what I can say is that the law, as I understand it, requires companies to report CSAM once they are aware of it. Working in this field I’ve learned two things pertinent to this question: (1) Most people don’t know how pervasive of an issue this is, and (2) There aren’t a lot of easy ways to start protecting your platform from this abuse. No one wants the cool new products and platforms they make to be used to abuse children. Privacy is important too, which is why solutions that preserve privacy and avoid leaking private information to third parties are critical, and perceptual hashing allows us to do both.

[1] https://www.technologycoalition.org/

[2] https://perception.thorn.engineering/en/latest/examples/benc...

[3] https://getsafer.io

EDIT: Line breaks

A false positive rate of 1/1000 is hard to assess without actual prevalence stats, but with a decent-sized userbase it seems likely you're still going to get a significant number of false positives. Is it intended that users of your system would have employees manually vet all positives (with legal and mental health concerns) or just submit them without review? I'm coming from having built tools to support a large manual sweep in the 2000s and watching the toll it took on my coworkers.
Great question — organizations decide how to handle reviews internally. So the answer to your question on “review all” versus “automatically submit” is a, perhaps unsatisfying, but honest: it depends. We provide a guide [1] to help organizations formulate their own policies. And we're currently working on a content moderation tool that focuses on helping organizations operationally handle problematic content and considers the wellness and resiliency of reviewers.

[1] https://www.thorn.org/sound-practices-guide-stopping-child-a...

Oh good, I'm sure most organizations can use something like that guide as well as the tools. There's a lot of legitimate worry about both the wellness side & the legal exposure issues, but it seems like beyond the common wisdom to be very careful (somehow) I think in a lot of minds there's a lack of clarity as to what exactly that means. Is there a particular reason access to the guide requires handing over contact information?
>I'm coming from having built tools to support a large manual sweep in the 2000s

Any chance you're one of the devs behind C4All?

No, thank god. I was hired as technical lead for the team put together when my new employers inherited a medium-sized low profile social network primarily popular in SA and SEA from their parent company. We got a report on something in the supposedly mostly unused photo sharing feature and discovered there were no moderation tools at all, so the first thing I had to do was build something to even verify the reports. At that point it turned out to be necessary to do a sweep and legal thought it should be done by hand, so my coworkers spent days going through it. It would have been grueling for them even if they hadn't run into the awful material.

(I escaped having to help because my sole minion was the kind of guy who decompresses a several gb bzipped log file as root in _/root_ and wanders away while it's running.)

It seems like an image classifier would work in this field. Are you aware of ML based image recognition and its utility here?

I'm imagining training data would be a hurdle, but surely you could give instructions or suggestions on how to train a model to people already authorized to work with the images.

Wouldn't a human have to go through thousands (or more) of illicit images and classify them in order to train the AI?
Given the kind of burn rate those people have at government agencies I would guess that they have some form of partnership with aforementioned agencies. I glanced through their FAQ and site but didn't see anything specifying that however.

Other than that I have no idea how you would even be able to have the images to classify in the first place without running into problems.

To build a classifier, yes, you are correct. But this isn’t a classifier to identify new content that has never been seen. This uses perceptual hashes to help organizations detect if known CSAM is being shared on their platform.
For an organization that uses this, wouldn't they need to have access to a source of constantly updated known CSAM? How is that going to work?
Was there discussion about the downsides of open sourcing the implementation? With an open source implementation, it becomes easier to test transformations that will change the hash without changing the image in ways a human can’t notice?
Great question -- all of the hashes we included in the package have been public for years (except for PDQ, which was open-sourced this year). So this package doesn't reveal anything new with respect to the algorithms themselves. What we add is an easy path to using at least one of them for the CSAM hashing / matching use case. Non-public perceptual hashes for this use case exist and, naturally, are not available in the package.
> help more people eliminate child sexual abuse material from the internet

Is that actually a good thing?

Less material seems like it would mean more motivation to produce more, which is the very thing we want to avoid.

Can you comment on how this compares to PhotoDNA? I looked at the readme and was surprised to not see mention of it.
Great question! PhotoDNA is the most well-known and supported hash function for this use case. And we do support hashing and matching with PhotoDNA in the Safer product. However, the PhotoDNA hash function is non-public so we cannot include it in an open source package. We support pHash as an open source alternative so that companies without PhotoDNA licenses can get started with hashing.
It’s been a long time since I had to implement PhotoDNA (helped write a PHP native version of the PhotoDNA hash function at Tumblr), can you indicate if pHash creates compatible hashes to PhotoDNA’s output?
Cool (re: writing a PHP version)! Generally speaking, hashes from different hash algorithms cannot be used with each other.

By the way, would be glad to connect and chat, especially if you have any thoughts on pain points associated with perceptual hashing (email in profile).

Does NCMEC support collection and distribution of hashes for alternative functions these days? Six years ago they only supported PhotoDNA hashes (hence why we ported Microsoft’s version of the hash function).

My email is in my profile too, feel free to reach out if I forget.

> hashes from different hash algorithms cannot be used with each other

That’s what I thought, but your mention of pHash alluded to it maybe being an open source drop in replacement for PhotoDNA, so I wanted to clarify instead of assuming.

Great work faustomorales. Thank you and your colleagues so much for this. Greeting from your old buddies ;)
Thanks, old friend! Hope you decide to give the package a peek and maybe lend us some of your Python chops. :)
Thank you for what you and your team do. Is there anyway I can help contribute to your organization?
Thank you so much for asking! Naturally, part of why we wanted to share this with the broader community was to make it so interested people can jump in and help out in the open. And I would be remiss if I didn’t mention the fact that we’re hiring! [1]

We can always use help raising awareness. Advocating for more survivor resources is a great place to start. Help spread the word through your social networks by connecting with us on Facebook and Twitter. Learn about even more ways to get involved and subscribe to our newsletter for general updates [2].

[1] https://thorn.org/careers

[2] https://thorn.org/join-us