A friend of mine told me about how they once had to dig through their code to figure out why their site was classified as adult by some filter. After days of searching, they found this comment at the bottom of a javascript file:
Another false positive anecdote! My parents used to have their ISP's adult content filter enabled. One day I couldn't visit DeviantArt, because it said "Mature Content Filter Enabled" somewhere on the page, and "mature content" triggered the ISP's filter.
I don't find this very useful. It's too naïve for a real-world usecase.
I didn't look at the implementation, but the "classy party" looks like it simply matches for a sequence of 'a', 's', and 's' bytes in a string.
It would be better it it tokenized the sentence using punctuation and white-space as terminators. So, it would detect `big-ass sandwich` and `smart-ass person` but not `classy party` or `bass instrument`.
Furthermore, it would be cool if you created a configuration format for this kind of thing, so one could do something like this (excuse the config format, I realise it's probably shit and problematic):
[smart][big][fat]ass
!sex[ual]+education
which would detect all of the following: smartass, bigass, fatass, and ass itself. The second rule would not filter `sex(?:ual)` token followed by an `education` token. You get the idea
These are just some heat-of-the-moment ideas, because I think this is exciting and could be useful. :-)
Thanks. This quick idea worked for my cases, because there were few potential false positives. But your idea around using a regex style matcher should be good.
With the little effort of google translate your dirty words to Spanish (copy paste all words), you obtain a filter for Spanish, add synonyms for stronger filtering.
Perhaps gay is not a dirty word? (is included in your dirty words, but gay people should think otherwise.
I'm gay, but I don't consider it offensive that the word is in there.
A lot of people use the term "gay" in conversation as a synonym for "that sucks"; a friend of mine does it all the time. I don't think they mean anything by it.
To differentiate between "I am gay," and "Oh that's gay. I'm sorry that happened," you'd need a NLP with a politeness preference.
Sorry, no offense intended, if anyone took it. In my use-case, the words such as 'gay' and 'lesbian' were in almost all cases, used for explicit documents.
This is a very naive implementation to quickly get a handle of amount of porny documents. I intend to do some more work around clustering of porny words. I think understanding sentiment would be hard and involves a lot of labeled data, but that is a potentially very useful project.
I wanted something easy to use to quickly get an idea of how much explicit content could we be dealing with. The main challenge was dealing with a multi-lingual database. I didn't even find a naive classifier.
Though I don't have time/RoI to improve this, but potential ideas are to use labeled data to cluster porny words and get a probablistic metric of porni-ness of a sentence.
// Slut.
Which is Danish for "the end".