Hacker News new | ask | show | jobs
by Elepsis 3990 days ago
Microsoft made an automated system (PhotoDNA) for detecting known child pornography images available to the public a few years ago and it's probably a good starting point: http://www.microsoft.com/en-us/PhotoDNA/

Hopefully this can help you.

(Disclosure: I work at Microsoft but not on PhotoDNA.)

6 comments

PhotoDNA is the gold standard for this. I tried to get access to this via the NCMEC to use with Neocities, but the process was, frankly, very convoluted. I signed at least 10 forms and still didn't end up getting what I needed.

I'm happy that Microsoft is providing this as a free service. It's going to be a lot less painful for me to use it than to figure out how to run my own (or in this case, figure out how to even get it).

Update: I just tried to get it to work and surprise! It doesn't work.

Somebody please just give me access to the PhotoDNA code, the hashes, and a little funding. I'll make an API anybody can use for this. It's ridiculous how hard it is to do this. It's still easier for people to get spam IP lists than to see if CP is being uploaded to their servers. You can't just have it available for Facebook and Google or it doesn't work, you need to make it available to everybody in an easy, simple way.

Seriously, if you are connected with this at all or want to fund this work please email me, I am more than happy to work on improving this: kyle@neocities.org.

Contact this person. He manages Microsoft's Safety Services division, which includes PhotoDNA.

https://www.linkedin.com/pub/john-scarrow/6/2b8/354

Would you like to describe what did not work? What was your experience? Please blog about this!
On the other hand, if everyone uses the same non-transparent list of magic hashes to ban hosting images then censorship potentially becomes a concern.
If non-CP images start being blocked by this system, some gallery author is going to notice it pretty quickly and report it to the website owner. This censorship problem is remarkably easy to destroy the trust of the CP filter and I doubt people at Microsoft would fail to predict this.
Yahoo, as well as Tumblr, uses an internal service that is synced with PhotoDNA. Content is also filtered by outsourced help as well. There is some company that employs people in some Southeast Asia country which provide this service to many Internet companies (including Yahoo). (I work at Yahoo and discussed this with the team here).
what company? Please ask for the name. Thanks!
Do you know any other software like this that is open-source? I posted a thread a few days ago about image comparing software like google reverse image search. Similar to OP, I wanted to index a few popular image boards and make sure that no one had tried to post unauthorized photos on them.

edit: I should clarify not child porn, but personal photos such as instagram and facebook which are private/semi-private and then are posted to public forums.

There's a system called IQDB used for various 'booru' websites. It's open sourced and available here: http://iqdb.org/code/

Really though it's not too hard to whip something up yourself. I did it for a bunch of those 'booru' sites (roughly 3 million images) like this:

- Find image hashing library (I used https://github.com/JohannesBuchner/imagehash but there's a nice series of articles here http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-I... if you want to implement your own)

- Build a database of image hashes using said library

- Use an algorithm that allows you to lookup hashes by distance. In the case of hamming distance (used by many image hashes) you can just throw them in MySQL. You could also use any of the nearest neighbours search algorithms like k Nearest Neighbours or locality sensitive hashing (you'd want one of these for larger datasets)

Why does one need to sign in to azure to help fight CP? Why are the the hashes not available via a public API so that any webmaster could just use it right now? Please explain.
Why would it even need an API? Just provide the hashes.

What are they afraid of? That "pedophile hackers" will be able to reverse the hashes and get the images?

The service is more than a hash matching service. It hashes different regions of the image, allowing it to match images that have been altered.

Authenticating access to the service is desirable for many obvious reasons.

>Authenticating access to the service is desirable for many obvious reasons.

Help me out with the obviousness please? Are those reasons more important than deleting child pornography from the web?

Off the top of my head: DoS, providing perverts with confirmation that an image is what they want it to be, giving organized groups intel that images are known to law enforcement, etc.
Perhaps to avoid people figuring out how to evade it?
Just compressing the images into an archive, if not encrypting them, is enough to evade such a filter.

There are a lot more legitimate uses for a public repository of CP hashes along with free software for verifying them locally. Not only entrepreneurs and online community operators who don't want the stuff on display, but also users of poorly moderated online communities who don't want the stuff in their browser caches.

One vendor selling an automatic child porn filter using data from INTERPOL is https://www.netclean.com/ It is also using Microsoft photoDNA technology.
Interesting. I had not known about that service and it's cool that it's free (though I'm not sure if it's always free; it says to qualified applicants).

I am curious how one goes about developing a service like this without having to see child porn itself. Is there a database somewhere with known hashes? I'm assuming there would have to be along with a way to generate hashes yourself so you could test as I couldn't imagine running automated unit testing with real child porn.

AFAIK the database of hashes is maintained by the National Center for Missing and Exploited Children, who unfortunately do have to deal with the disheartening task of viewing some of that stuff.
I was going to suggest a convolutional NN, but you'd need to go through the gruelling task of creating the training corpus.