I am running it through a certain set of filters. From my SEO days I recalled that new websites are often penalized based on the certain keywords in search engines. Considering this is a new site, and there is 300 million plus posts and I am not able to read and moderate it, this is the best way I know of to deal with it. But perhaps you're right and I should get rid of it. I'll think about it. This is a valid comment.
Since you seem intent on being a reference usenet archive I think it's important to preserve the integrity of the original material. Moderating posts 20 or 30 years after the fact seems ill advised. If you modify the content in any way, at least put a prominent notice so that people don't get confused by the website name.
Also, it seems that your parsing process strips headers and that you don't keep the raw messages, however I remember that on some newsgroups people used to pass secret messages in headers that only those "in the know" would look for, it would be a shame to lose that. Access to posts in raw format would be nice in this scenario.
Maybe rot13 the words you think you need to censor? That'd be in keeping with the usenet tradition at least from the mid-late 90s when I was reading/posting heavily. And maybe add a simple javascript ROT13 widget so people can easily reveal it? (There was a time in my life when I could read ROT13-ed things pretty accurately in my head.)
You should definitely get rid of whatever is being used currently. The first group I randomly clicked (alt.alien.visitors) was censoring the word "public" (and "sucks" and "pipe"), multiple times in the same post which, if it happens a lot, especially on innocuous words, is really going to spoil what is an excellent project.
Its not a bad idea to filter content though, and/or have a flag button on threads/posts. 300 million articles from 40 years of an obscure and anarchic corner of the internet are bound to contain posts that are either potentially illegal or which you otherwise don't necessarily want to be publishing.
> For an archive this is a big no-no. Respect the source material!
Though I'm also curious, that's perhaps not the tone I would have used when asking. After all, better a censored archive than no archive.
I'm just speculating, but it may be the policy of usenetarchives.com, in order to accept their upload.
Censoring seems to be done around email addresses, names, and offensive words. Perhaps, this is done to reduce the chances of people later asking for the posts to be taken down entirely.
For example, I believe the takedown by Google of comp.lang.lisp and comp.lang.forth commented elsewhere was done because there was offensive content present. The Google support request that mentioned that reason was taken down, but it's what I remember.
For that example, it's at least fairly obvious why it was censored, but this one really puzzles me:
"Getting good FP performance from a micro seems to require
pipelining. Keeping the p<asterisk><asterisk>e(s) full seems to require a certain amount of parallelism and regularity."
Sigh. Whatever it says about you that the first place you looked was soc.sexuality.general, it is even more saddening that I knew all the words which had been redacted.