Hacker News new | ask | show | jobs
by throwaway89201 2080 days ago
Why are many words censored? To take a completely random example (I just took one with many asterisks) where it makes the post completely illegible: https://www.usenetarchives.com/view.php?id=soc.sexuality.gen...

For an archive this is a big no-no. Respect the source material!

Otherwise, thank you for the time spent doing this.

8 comments

I am running it through a certain set of filters. From my SEO days I recalled that new websites are often penalized based on the certain keywords in search engines. Considering this is a new site, and there is 300 million plus posts and I am not able to read and moderate it, this is the best way I know of to deal with it. But perhaps you're right and I should get rid of it. I'll think about it. This is a valid comment.
Please do not filter anything. A project of this scope is greater than any SEO issues you may have.
Since you seem intent on being a reference usenet archive I think it's important to preserve the integrity of the original material. Moderating posts 20 or 30 years after the fact seems ill advised. If you modify the content in any way, at least put a prominent notice so that people don't get confused by the website name.

Also, it seems that your parsing process strips headers and that you don't keep the raw messages, however I remember that on some newsgroups people used to pass secret messages in headers that only those "in the know" would look for, it would be a shame to lose that. Access to posts in raw format would be nice in this scenario.

Maybe rot13 the words you think you need to censor? That'd be in keeping with the usenet tradition at least from the mid-late 90s when I was reading/posting heavily. And maybe add a simple javascript ROT13 widget so people can easily reveal it? (There was a time in my life when I could read ROT13-ed things pretty accurately in my head.)
Double rot13 just to be sure.
It's 2020 and we're under threat from state-level hackers. We need quadruple rot13!
:-)
I've decided to remove bad word filtering and all other censoring. Let's see how it goes.
Thank you :)
Epic!
One option is to censor by default for SEO, but have some checkbox that sets a cookie that uncensors it.
This would be a really cool sulotion, kind of the reversal of typical seo bombing techniques that hide spam pages on compromised sites.
You should definitely get rid of whatever is being used currently. The first group I randomly clicked (alt.alien.visitors) was censoring the word "public" (and "sucks" and "pipe"), multiple times in the same post which, if it happens a lot, especially on innocuous words, is really going to spoil what is an excellent project.

Its not a bad idea to filter content though, and/or have a flag button on threads/posts. 300 million articles from 40 years of an obscure and anarchic corner of the internet are bound to contain posts that are either potentially illegal or which you otherwise don't necessarily want to be publishing.

I've removed the filtering.
You can also remove the filter for users, but use site maps to make those posts not visible to search engines.
Thanks. Much appreciated. This way we get to experience the colorful humans in full.
Are you planning to monetize? If not, then keep SEO out of it.
> For an archive this is a big no-no. Respect the source material!

Though I'm also curious, that's perhaps not the tone I would have used when asking. After all, better a censored archive than no archive.

I'm just speculating, but it may be the policy of usenetarchives.com, in order to accept their upload.

Censoring seems to be done around email addresses, names, and offensive words. Perhaps, this is done to reduce the chances of people later asking for the posts to be taken down entirely.

For example, I believe the takedown by Google of comp.lang.lisp and comp.lang.forth commented elsewhere was done because there was offensive content present. The Google support request that mentioned that reason was taken down, but it's what I remember.

I've decided to remove bad word filtering and all other censoring. Let's see how it goes.
For that example, it's at least fairly obvious why it was censored, but this one really puzzles me:

"Getting good FP performance from a micro seems to require pipelining. Keeping the p<asterisk><asterisk>e(s) full seems to require a certain amount of parallelism and regularity."

https://www.usenetarchives.com/view.php?id=comp.arch&g=14965...

Edit: Ah, the irony. HN markdown causes two consecutive asterisks to disappear.

You can note markup in a code block --- 4 space indent.

    A code block.
    
    Two asterisks follow: **
   
    This normally would be *italicised*.
It's two spaces. Of course four will work as well, it just adds extra indentation.
Thanks.
> For that example, it's at least fairly obvious why it was censored.

Actually, it's not that obvious. It censored "Dirty Sanchez". I'm thinking it thought it was a person's name, and censored it for privacy reasons?

> Keeping the p<asterisk><asterisk>e(s) full seems to

I suppose "pipe" can have an offensive/sexual connotation. Even if it doesn't so much today, perhaps it did back then.

My guess is that they just used a fairly large word list, which contained a bunch of euphomisms like Dirty Sanchez.
A good guess, that's exactly what I did.
IIRC, 'Dirty Sanchez' is a slang term for a sexual act.
"pipe" means blowjob in french. Maybe the filter dictionary is multilingual?
It can happen. A decade ago, Apple’s automated App Store processes warned me that “Knopf” was a dirty word in German.

(It isn’t: while “knob” can be translated as “Knopf”, the latter doesn’t have the anatomical meanings of the former).

For future historians, the post is about the sex act known as the “dirty sanchez”.
This is not what I expected to learn today on HN
Sigh. Whatever it says about you that the first place you looked was soc.sexuality.general, it is even more saddening that I knew all the words which had been redacted.
Not a perfect filter:

> On Oct 13, 11:36 am, "Colin" <Co...@DirtySanchez.b••t> wrote:

It is also homophobic as the word "lesbian" is one of the words censored.
Some of us appreciate a good mask of asterisks.