I wanted something easy to use to quickly get an idea of how much explicit content could we be dealing with. The main challenge was dealing with a multi-lingual database. I didn't even find a naive classifier.
Though I don't have time/RoI to improve this, but potential ideas are to use labeled data to cluster porny words and get a probablistic metric of porni-ness of a sentence.
Though I don't have time/RoI to improve this, but potential ideas are to use labeled data to cluster porny words and get a probablistic metric of porni-ness of a sentence.