Hacker News new | ask | show | jobs
by jozefjarosciak 2080 days ago
Folks, I am the guy behind this project. A friend of mine mentioned he saw the site mentioned on hacker news, so I came to check it out. If you have any questions for me, don't hesitate to ask, as time permits (and two little boys), I'll do my best to answer them.
15 comments

Why are many words censored? To take a completely random example (I just took one with many asterisks) where it makes the post completely illegible: https://www.usenetarchives.com/view.php?id=soc.sexuality.gen...

For an archive this is a big no-no. Respect the source material!

Otherwise, thank you for the time spent doing this.

I am running it through a certain set of filters. From my SEO days I recalled that new websites are often penalized based on the certain keywords in search engines. Considering this is a new site, and there is 300 million plus posts and I am not able to read and moderate it, this is the best way I know of to deal with it. But perhaps you're right and I should get rid of it. I'll think about it. This is a valid comment.
Please do not filter anything. A project of this scope is greater than any SEO issues you may have.
Since you seem intent on being a reference usenet archive I think it's important to preserve the integrity of the original material. Moderating posts 20 or 30 years after the fact seems ill advised. If you modify the content in any way, at least put a prominent notice so that people don't get confused by the website name.

Also, it seems that your parsing process strips headers and that you don't keep the raw messages, however I remember that on some newsgroups people used to pass secret messages in headers that only those "in the know" would look for, it would be a shame to lose that. Access to posts in raw format would be nice in this scenario.

Maybe rot13 the words you think you need to censor? That'd be in keeping with the usenet tradition at least from the mid-late 90s when I was reading/posting heavily. And maybe add a simple javascript ROT13 widget so people can easily reveal it? (There was a time in my life when I could read ROT13-ed things pretty accurately in my head.)
Double rot13 just to be sure.
It's 2020 and we're under threat from state-level hackers. We need quadruple rot13!
:-)
I've decided to remove bad word filtering and all other censoring. Let's see how it goes.
One option is to censor by default for SEO, but have some checkbox that sets a cookie that uncensors it.
This would be a really cool sulotion, kind of the reversal of typical seo bombing techniques that hide spam pages on compromised sites.
You should definitely get rid of whatever is being used currently. The first group I randomly clicked (alt.alien.visitors) was censoring the word "public" (and "sucks" and "pipe"), multiple times in the same post which, if it happens a lot, especially on innocuous words, is really going to spoil what is an excellent project.

Its not a bad idea to filter content though, and/or have a flag button on threads/posts. 300 million articles from 40 years of an obscure and anarchic corner of the internet are bound to contain posts that are either potentially illegal or which you otherwise don't necessarily want to be publishing.

I've removed the filtering.
You can also remove the filter for users, but use site maps to make those posts not visible to search engines.
Thanks. Much appreciated. This way we get to experience the colorful humans in full.
Are you planning to monetize? If not, then keep SEO out of it.
> For an archive this is a big no-no. Respect the source material!

Though I'm also curious, that's perhaps not the tone I would have used when asking. After all, better a censored archive than no archive.

I'm just speculating, but it may be the policy of usenetarchives.com, in order to accept their upload.

Censoring seems to be done around email addresses, names, and offensive words. Perhaps, this is done to reduce the chances of people later asking for the posts to be taken down entirely.

For example, I believe the takedown by Google of comp.lang.lisp and comp.lang.forth commented elsewhere was done because there was offensive content present. The Google support request that mentioned that reason was taken down, but it's what I remember.

I've decided to remove bad word filtering and all other censoring. Let's see how it goes.
For that example, it's at least fairly obvious why it was censored, but this one really puzzles me:

"Getting good FP performance from a micro seems to require pipelining. Keeping the p<asterisk><asterisk>e(s) full seems to require a certain amount of parallelism and regularity."

https://www.usenetarchives.com/view.php?id=comp.arch&g=14965...

Edit: Ah, the irony. HN markdown causes two consecutive asterisks to disappear.

You can note markup in a code block --- 4 space indent.

    A code block.
    
    Two asterisks follow: **
   
    This normally would be *italicised*.
It's two spaces. Of course four will work as well, it just adds extra indentation.
Thanks.
> For that example, it's at least fairly obvious why it was censored.

Actually, it's not that obvious. It censored "Dirty Sanchez". I'm thinking it thought it was a person's name, and censored it for privacy reasons?

> Keeping the p<asterisk><asterisk>e(s) full seems to

I suppose "pipe" can have an offensive/sexual connotation. Even if it doesn't so much today, perhaps it did back then.

My guess is that they just used a fairly large word list, which contained a bunch of euphomisms like Dirty Sanchez.
A good guess, that's exactly what I did.
IIRC, 'Dirty Sanchez' is a slang term for a sexual act.
"pipe" means blowjob in french. Maybe the filter dictionary is multilingual?
It can happen. A decade ago, Apple’s automated App Store processes warned me that “Knopf” was a dirty word in German.

(It isn’t: while “knob” can be translated as “Knopf”, the latter doesn’t have the anatomical meanings of the former).

For future historians, the post is about the sex act known as the “dirty sanchez”.
This is not what I expected to learn today on HN
Sigh. Whatever it says about you that the first place you looked was soc.sexuality.general, it is even more saddening that I knew all the words which had been redacted.
Not a perfect filter:

> On Oct 13, 11:36 am, "Colin" <Co...@DirtySanchez.b••t> wrote:

It is also homophobic as the word "lesbian" is one of the words censored.
Some of us appreciate a good mask of asterisks.
I could have sworn I did this 9 years ago. :-) http://kitenet.net/~joey/blog/entry/announcing_olduse.net/

Did you find additional utzoo material beyond what was already on archive.org in tarballs?

THANK you for doing this (I said, having never so much as logged in to a usenet thingie)!

I do have one question. Those tapes...how old were they? Were they contemporary to the postings? (bonus question: if so wow, but under what justification?. tapes were expensive right and nobody valued archiving at the time (except maybe this guy)?) this individual you got the tapes from meticulously copy onto new media and all the overhead that entails?

Today I make personal backups of things I want to keep, whether local files or web snippets and burn 'em to a DVD , blu-ray, or optical disk of some kind with the justification of 'can't ransomware WORM media'.

However, I don't do the internet archiving guru stuff of '3 copies, 2 mediums, 1 routine' (or something like that) so in a sense my backups are cruising for a bruising in the case that optical disks could go bad, hard drives could die (these I do migrate to new media when I get around to it, CDs I just read and burn to a fresh one), house could burn down, etc.

Partially looking to see if I can get justification for my backup slovenliness ;)

To satisfy the "third place" you might look into storing these files in Amazon S3 Glacier, which is about $1/TB/month, so long as you don't read them back. (Retrieval is delayed, batched, and expensive.)

It would take some engineering: I might store each optical disk image as a compressed image file, for example (zstd would be good for the large amount of data), to avoid metadata charges. Fun to think about.

I just want to say thank you to you and everyone involved! I ran a little mini project a while ago to archive my old Usenet posts by using puppeteer to scrape the content from Google groups, knowing full well that they could shut it down by fiat any time.

I've frankly never trusted them as a steward for this and I'm glad to see someone stepping in.

Do you have any plans to team up with the Internet Archive on this?

Once I have everything in DB format, I'll certainly attempt to post it there.
Please get in touch with collections-service@archive.org (Internet Archive Patron Services) when you’re ready. They will assist you with uploading the corpus to the archive (item creation, collection creation and assignment, metadata hygiene, etc). Thank you for your efforts!
Excellent! Btw, in the DB format the posts are still uncensored or?
Nice work my friend. As many demo-scene groups were active on newsgroups before going very dark almost 30 years ago I'm curious if you also have a collection that you can contribute to this basic collection https://archive.org/details/scenenotices . There's lots of this stuff going on throughout the decades but you may be in a position to help preserve history / lineage on the entire environment.
I won't have much when it comes to binaries, I am text mostly, but in terms of text I should have things going that far back.
Archive.org destroyed their utzoo mirror.

It’s sadly not good enough.

There's weird gaps where groups have some posts from a given date but not all of the posts (e.g. I know I posted to csm.hypercard on 24 April 1990 and can see it in Google's dejanews cache, but not in this cache which has other posts to the group on that date). Was it just luck of the draw from what was cut to tape?

How do you handle cross posting?

I am still in the process of moving posts to the online DB.
How do you deal with the (European) right to be forgotten?

On one hand it’s nice that we can see all this old data. On the other hand, as far as I remember in the 90s you typically subscribed to a newsgroup and only then got recent posts. So the expectations was a limited life time of the post. Of course it would still be stored in thousands of readers.

Obviously the copyright of each post is with the author. You just assume that you have a right to redistribute.

I’m also happy if I get a pointer elsewhere if you have an url.

> How do you deal with the (European) right to be forgotten?

Probably by not being European.

Europe doesn't make laws for the whole world anymore. Colonialism is dead.

What about copyright?
Archiving is considered fair use
I'm getting this:

Fatal error: Uncaught Error: Call to a member function real_escape_string() on null in /var/www/html/usenetarchives.com/3.vars.php:80 Stack trace: #0 /var/www/html/usenetarchives.com/1.header.php(190): include() #1 /var/www/html/usenetarchives.com/view.php(22): include_once('/var/www/html/u...') #2 {main} thrown in /var/www/html/usenetarchives.com/3.vars.php on line 80

Pretty consistently.

on which pages, please share some examples... I was doing some changes there, let me know...
It's great that your service comes into existence at all, and if you set up an opencollective or patreon account, I'd gladly contribute to the operating costs.

As it exists right now, there are some glaring omissions. I would imagine that showing authors in the search results next to the titles would not be prohibitively difficult.

Proper threading of posts would probably be considerably harder, but would provide immense value IMHO.

Any chance at getting access to all the other raw data? The UTZOO archives have been invaluable, and the only reason they survive today is because they were mirrored.

DMCA took this incredible resource off line. Please don’t let this happen again, as even archive.org found it easier to destroy it all than to fight Marty.

Plus it allows all us little people build fun things with it, like my altavista based search of utzoo:

altavista.superglobalmegacorp.com

Thanks

Thank you! Is 2003 a hard limit to how far back was captured, or just a current point in the import? I was looking to comp.os.linux.* groups from a bit earlier than that, a lot of the 90s. Thanks again :)
People, once you get to a newsgroup, click on 'all years' to see which years are in the archive, and select which year you want. Don't expect posts from last millennium to be present, at least not yet.
At the time of my comment, 2003 was the oldest year present for the group(s) i mentioned. I searched for a FAQ before posting, it's not clearly marked "imports are still in progress". Please don't be condescending, OP specifically stated "ask me questions." You are not OP.

Edit: I am now on a desktop, and confirm it's not a mobile glitch; the archives for comp.os.linux are imported back to 2003 only at this time. https://www.usenetarchives.com/threads.php?id=comp.os.linux&...

Will you be making this available on nntp servers? Might be nice to see it on a text only nntp server. Sure one of the big usenet providers would want help seems on their interest.
Do you know where I can find an archive with the de.* (German) groups?
Do you have any concerns about the so-called 'right' to be forgotten? Are you concerned that anyone who was active back then might lose a job now due to his postings a couple of decades ago?
Agree with throwaway89201. Don't do that. Reproduce the text as you found it.