| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jozefjarosciak 2080 days ago
	Folks, I am the guy behind this project. A friend of mine mentioned he saw the site mentioned on hacker news, so I came to check it out. If you have any questions for me, don't hesitate to ask, as time permits (and two little boys), I'll do my best to answer them.

15 comments

throwaway89201 2080 days ago

Why are many words censored? To take a completely random example (I just took one with many asterisks) where it makes the post completely illegible: https://www.usenetarchives.com/view.php?id=soc.sexuality.gen...

For an archive this is a big no-no. Respect the source material!

Otherwise, thank you for the time spent doing this.

link

jozefjarosciak 2080 days ago

I am running it through a certain set of filters. From my SEO days I recalled that new websites are often penalized based on the certain keywords in search engines. Considering this is a new site, and there is 300 million plus posts and I am not able to read and moderate it, this is the best way I know of to deal with it. But perhaps you're right and I should get rid of it. I'll think about it. This is a valid comment.

link

DanAtC 2080 days ago

Please do not filter anything. A project of this scope is greater than any SEO issues you may have.

link

ajnin 2079 days ago

Since you seem intent on being a reference usenet archive I think it's important to preserve the integrity of the original material. Moderating posts 20 or 30 years after the fact seems ill advised. If you modify the content in any way, at least put a prominent notice so that people don't get confused by the website name.

Also, it seems that your parsing process strips headers and that you don't keep the raw messages, however I remember that on some newsgroups people used to pass secret messages in headers that only those "in the know" would look for, it would be a shame to lose that. Access to posts in raw format would be nice in this scenario.

link

bigiain 2080 days ago

Maybe rot13 the words you think you need to censor? That'd be in keeping with the usenet tradition at least from the mid-late 90s when I was reading/posting heavily. And maybe add a simple javascript ROT13 widget so people can easily reveal it? (There was a time in my life when I could read ROT13-ed things pretty accurately in my head.)

link

stragulus 2080 days ago

Double rot13 just to be sure.

link

jabl 2080 days ago

It's 2020 and we're under threat from state-level hackers. We need quadruple rot13!

link

bakul 2080 days ago

:-)

link

jozefjarosciak 2079 days ago

I've decided to remove bad word filtering and all other censoring. Let's see how it goes.

link

Thorrez 2080 days ago

One option is to censor by default for SEO, but have some checkbox that sets a cookie that uncensors it.

link

krsdcbl 2080 days ago

This would be a really cool sulotion, kind of the reversal of typical seo bombing techniques that hide spam pages on compromised sites.

link

pmachinery 2080 days ago

You should definitely get rid of whatever is being used currently. The first group I randomly clicked (alt.alien.visitors) was censoring the word "public" (and "sucks" and "pipe"), multiple times in the same post which, if it happens a lot, especially on innocuous words, is really going to spoil what is an excellent project.

Its not a bad idea to filter content though, and/or have a flag button on threads/posts. 300 million articles from 40 years of an obscure and anarchic corner of the internet are bound to contain posts that are either potentially illegal or which you otherwise don't necessarily want to be publishing.

link

jozefjarosciak 2079 days ago

I've removed the filtering.

link

Alex3917 2078 days ago

You can also remove the filter for users, but use site maps to make those posts not visible to search engines.

link

ddingus 2079 days ago

Thanks. Much appreciated. This way we get to experience the colorful humans in full.

link

erikbye 2079 days ago

Are you planning to monetize? If not, then keep SEO out of it.

link

jolmg 2080 days ago

> For an archive this is a big no-no. Respect the source material!

Though I'm also curious, that's perhaps not the tone I would have used when asking. After all, better a censored archive than no archive.

I'm just speculating, but it may be the policy of usenetarchives.com, in order to accept their upload.

Censoring seems to be done around email addresses, names, and offensive words. Perhaps, this is done to reduce the chances of people later asking for the posts to be taken down entirely.

For example, I believe the takedown by Google of comp.lang.lisp and comp.lang.forth commented elsewhere was done because there was offensive content present. The Google support request that mentioned that reason was taken down, but it's what I remember.

link

jozefjarosciak 2079 days ago

I've decided to remove bad word filtering and all other censoring. Let's see how it goes.

link

usefulcat 2080 days ago

For that example, it's at least fairly obvious why it was censored, but this one really puzzles me:

"Getting good FP performance from a micro seems to require pipelining. Keeping the p<asterisk><asterisk>e(s) full seems to require a certain amount of parallelism and regularity."

https://www.usenetarchives.com/view.php?id=comp.arch&g=14965...

Edit: Ah, the irony. HN markdown causes two consecutive asterisks to disappear.

link

dredmorbius 2080 days ago

You can note markup in a code block --- 4 space indent.

    A code block.
    
    Two asterisks follow: **
   
    This normally would be *italicised*.

link

Stratoscope 2079 days ago

It's two spaces. Of course four will work as well, it just adds extra indentation.

link

dredmorbius 2079 days ago

Thanks.

link

jolmg 2080 days ago

> For that example, it's at least fairly obvious why it was censored.

Actually, it's not that obvious. It censored "Dirty Sanchez". I'm thinking it thought it was a person's name, and censored it for privacy reasons?

> Keeping the p<asterisk><asterisk>e(s) full seems to

I suppose "pipe" can have an offensive/sexual connotation. Even if it doesn't so much today, perhaps it did back then.

link

ehsankia 2080 days ago

My guess is that they just used a fairly large word list, which contained a bunch of euphomisms like Dirty Sanchez.

link

jozefjarosciak 2080 days ago

A good guess, that's exactly what I did.

link

Jaruzel 2080 days ago

IIRC, 'Dirty Sanchez' is a slang term for a sexual act.

link

glaberficken 2079 days ago

"pipe" means blowjob in french. Maybe the filter dictionary is multilingual?

link

ben_w 2078 days ago

It can happen. A decade ago, Apple’s automated App Store processes warned me that “Knopf” was a dirty word in German.

(It isn’t: while “knob” can be translated as “Knopf”, the latter doesn’t have the anatomical meanings of the former).

link

stevula 2080 days ago

For future historians, the post is about the sex act known as the “dirty sanchez”.

link

Sebb767 2080 days ago

This is not what I expected to learn today on HN

link

IncRnd 2080 days ago

Sigh. Whatever it says about you that the first place you looked was soc.sexuality.general, it is even more saddening that I knew all the words which had been redacted.

link

CGamesPlay 2080 days ago

Not a perfect filter:

> On Oct 13, 11:36 am, "Colin" <Co...@DirtySanchez.b••t> wrote:

link

cakeplease 2080 days ago

It is also homophobic as the word "lesbian" is one of the words censored.

link

hunter2_ 2080 days ago

Some of us appreciate a good mask of asterisks.

link

joeyh 2079 days ago

I could have sworn I did this 9 years ago. :-) http://kitenet.net/~joey/blog/entry/announcing_olduse.net/

Did you find additional utzoo material beyond what was already on archive.org in tarballs?

link

Multicomp 2080 days ago

THANK you for doing this (I said, having never so much as logged in to a usenet thingie)!

I do have one question. Those tapes...how old were they? Were they contemporary to the postings? (bonus question: if so wow, but under what justification?. tapes were expensive right and nobody valued archiving at the time (except maybe this guy)?) this individual you got the tapes from meticulously copy onto new media and all the overhead that entails?

Today I make personal backups of things I want to keep, whether local files or web snippets and burn 'em to a DVD , blu-ray, or optical disk of some kind with the justification of 'can't ransomware WORM media'.

However, I don't do the internet archiving guru stuff of '3 copies, 2 mediums, 1 routine' (or something like that) so in a sense my backups are cruising for a bruising in the case that optical disks could go bad, hard drives could die (these I do migrate to new media when I get around to it, CDs I just read and burn to a fresh one), house could burn down, etc.

Partially looking to see if I can get justification for my backup slovenliness ;)

link

ericbarrett 2080 days ago

To satisfy the "third place" you might look into storing these files in Amazon S3 Glacier, which is about $1/TB/month, so long as you don't read them back. (Retrieval is delayed, batched, and expensive.)

It would take some engineering: I might store each optical disk image as a compressed image file, for example (zstd would be good for the large amount of data), to avoid metadata charges. Fun to think about.

link

CarelessExpert 2080 days ago

I just want to say thank you to you and everyone involved! I ran a little mini project a while ago to archive my old Usenet posts by using puppeteer to scrape the content from Google groups, knowing full well that they could shut it down by fiat any time.

I've frankly never trusted them as a steward for this and I'm glad to see someone stepping in.

Do you have any plans to team up with the Internet Archive on this?

link

jozefjarosciak 2080 days ago

Once I have everything in DB format, I'll certainly attempt to post it there.

link

toomuchtodo 2079 days ago

Please get in touch with collections-service@archive.org (Internet Archive Patron Services) when you’re ready. They will assist you with uploading the corpus to the archive (item creation, collection creation and assignment, metadata hygiene, etc). Thank you for your efforts!

link

codetrotter 2079 days ago

Excellent! Btw, in the DB format the posts are still uncensored or?

link

KyleSanderson 2080 days ago

Nice work my friend. As many demo-scene groups were active on newsgroups before going very dark almost 30 years ago I'm curious if you also have a collection that you can contribute to this basic collection https://archive.org/details/scenenotices . There's lots of this stuff going on throughout the decades but you may be in a position to help preserve history / lineage on the entire environment.

link

jozefjarosciak 2080 days ago

I won't have much when it comes to binaries, I am text mostly, but in terms of text I should have things going that far back.

link

scruffyherder 2079 days ago

Archive.org destroyed their utzoo mirror.

It’s sadly not good enough.

link

epc 2080 days ago

There's weird gaps where groups have some posts from a given date but not all of the posts (e.g. I know I posted to csm.hypercard on 24 April 1990 and can see it in Google's dejanews cache, but not in this cache which has other posts to the group on that date). Was it just luck of the draw from what was cut to tape?

How do you handle cross posting?

link

jozefjarosciak 2080 days ago

I am still in the process of moving posts to the online DB.

link

lixtra 2079 days ago

How do you deal with the (European) right to be forgotten?

On one hand it’s nice that we can see all this old data. On the other hand, as far as I remember in the 90s you typically subscribed to a newsgroup and only then got recent posts. So the expectations was a limited life time of the post. Of course it would still be stored in thousands of readers.

Obviously the copyright of each post is with the author. You just assume that you have a right to redistribute.

I’m also happy if I get a pointer elsewhere if you have an url.

link

msla 2079 days ago

> How do you deal with the (European) right to be forgotten?

Probably by not being European.

Europe doesn't make laws for the whole world anymore. Colonialism is dead.

link

lixtra 2079 days ago

What about copyright?

link

SturgeonsLaw 2078 days ago

Archiving is considered fair use

link

cmrdporcupine 2080 days ago

I'm getting this:

Fatal error: Uncaught Error: Call to a member function real_escape_string() on null in /var/www/html/usenetarchives.com/3.vars.php:80 Stack trace: #0 /var/www/html/usenetarchives.com/1.header.php(190): include() #1 /var/www/html/usenetarchives.com/view.php(22): include_once('/var/www/html/u...') #2 {main} thrown in /var/www/html/usenetarchives.com/3.vars.php on line 80

Pretty consistently.

link

jozefjarosciak 2080 days ago

on which pages, please share some examples... I was doing some changes there, let me know...

link

microtherion 2079 days ago

It's great that your service comes into existence at all, and if you set up an opencollective or patreon account, I'd gladly contribute to the operating costs.

As it exists right now, there are some glaring omissions. I would imagine that showing authors in the search results next to the titles would not be prohibitively difficult.

Proper threading of posts would probably be considerably harder, but would provide immense value IMHO.

link

scruffyherder 2079 days ago

Any chance at getting access to all the other raw data? The UTZOO archives have been invaluable, and the only reason they survive today is because they were mirrored.

DMCA took this incredible resource off line. Please don’t let this happen again, as even archive.org found it easier to destroy it all than to fight Marty.

Plus it allows all us little people build fun things with it, like my altavista based search of utzoo:

altavista.superglobalmegacorp.com

Thanks

link

gravitas 2080 days ago

Thank you! Is 2003 a hard limit to how far back was captured, or just a current point in the import? I was looking to comp.os.linux.* groups from a bit earlier than that, a lot of the 90s. Thanks again :)

link

thimkerbell 2080 days ago

People, once you get to a newsgroup, click on 'all years' to see which years are in the archive, and select which year you want. Don't expect posts from last millennium to be present, at least not yet.

link

gravitas 2079 days ago

At the time of my comment, 2003 was the oldest year present for the group(s) i mentioned. I searched for a FAQ before posting, it's not clearly marked "imports are still in progress". Please don't be condescending, OP specifically stated "ask me questions." You are not OP.

Edit: I am now on a desktop, and confirm it's not a mobile glitch; the archives for comp.os.linux are imported back to 2003 only at this time. https://www.usenetarchives.com/threads.php?id=comp.os.linux&...

link

davidwritesbugs 2079 days ago

Will you be making this available on nntp servers? Might be nice to see it on a text only nntp server. Sure one of the big usenet providers would want help seems on their interest.

link

cerebrum 2079 days ago

Do you know where I can find an archive with the de.* (German) groups?

link

zeveb 2079 days ago

Do you have any concerns about the so-called 'right' to be forgotten? Are you concerned that anyone who was active back then might lose a job now due to his postings a couple of decades ago?

link

CamperBob2 2080 days ago

Agree with throwaway89201. Don't do that. Reproduce the text as you found it.

link