Hacker News new | ask | show | jobs
Newsgroups and the Internet Archive: I Made a Difference (screengod.blogspot.com)
162 points by mailxplorer 3008 days ago
13 comments

It's pretty sad that a bunch of people sent historical Usenet archives to Google, they imported them into Google Groups, and then... basically hid/lost access to it over time. I assume the data is in GFS somewhere but it may never see the light of day. But maybe we just need to shame them every decade: https://www.wired.com/2009/10/usenet/
Google Groups was great when they first acquired all those archives! It just got progressively worse and worse over time, and I've never understood why. (Stagnation I could understand, but how did the same searches get less effective?)
Google has a continual Red Queen's race between everyone's dependencies. I'm guessing they were lucky to keep it mostly working and avoid Reader's fate.

"We're deprecating X, please migrate to Y by this date."

"Y doesn't support all our use cases!"

"Neither will X when we turn it off, sorry."

I get that Google essentially walked back from an at least implied role as an information archivist. But I still don't really get why they so completely abandoned things like Google Groups given the truly minuscule resources to maintain them at some low level.
Nobody can get promoted for improvements like that. It would only garner goodwill with a tiny customer base and it’s not clear how that would translate to revenue or user growth on other products
Totally agree, but I kinda feel (and this is solely my opinion) that much as it's our collective responsibility to donate to worthwhile causes, gigantic tech companies should spend rounding-error money on good digital causes like this.
But there do seem to be quite a few people at Google already working on random stuff that won't get them promoted.
What seems to have happened was that Google shifted focus from sorting the web to delivering the "zeitgeist" of the present.

Perhaps because more people were likely to search for whatever was on the news cycle than some arbitrary piece of info.

Never mind that Google's earnings comes from ads, not search results.

Google's original mission statement from 1998 was to “organize the world’s information and make it universally accessible and useful.” Things like Usenet archives are clearly part of that.

That is still apparently their mission statement but I agree the reality seems more along the lines of "deliver the best search results for users' immediate needs" [while making as much money from that as humanly possible]

It's sad. The original mission implied nothing was going to get lost and abandoned. So much for that.
Somebody (and most likely a team of somebodies) needs to own it. If nobody at the top is willing to champion the project and carve out a budget and people, then they need to fold it.
I wish Google would donate those logs to archive.org and maybe even some servers. Idk why some of these tech giants never donate to projects that care about the internet the very thing which drives their profits.
Google search in general has become less effective, so perhaps it simply carried over.
Search is abysmal these days. I'll look for a term with 3 words and it almost always throws out one word, giving me useless results.
”You searched for Stockholm Syndrome Research. Here’s a bunch of Stockholm travel blogs! They’re just missing two of your keywords, but they’re really popular with other visitors”
I just did this search, and at least the first page only had results perfectly matching the query.

Try yourself: https://www.google.com/search?q=Stockholm+Syndrome+Research

I think is possible to have the verbatim option active all the time, but I don't remember how.
And the people who simply accept this have...
Quotes still force the word to be included don't they? Though I suppose that still doesn't do much if the algorithm has tossed out the matches you want.
Google Groups was always terrible. It almost immediately started ruining Usenet.
Google's Usenet archive is too important to be owned by one company. My understanding is some Google engineers worked a few years ago to make sure Archive had a copy, but I can't verify that and I don't think it's online. Some of the earliest archives come from Eugene Spafford's collection which is readily online elsewhere, but Google did a lot of work cleaning it and of course has the DejaNews archives which are invaluable.

I'm amazed Google Groups still exists as a product, it seems abandoned internally and I expect to hear it's shut down any month now.

I miss dejanews - anyone remember that before Google bought them? They had full search, it's a shame Google let that go. Their search went back to pretty much the beginning of net news. Does that exist anywhere anymore?
At least some of the Usenet archives are still online in Google Groups and searchable. Can't vouch for completeness of the search index though, it looks pretty wonky for stuff from the early 1990s. Anyway, example working archive link I found via search: https://groups.google.com/forum/#!searchin/comp.os.minix/tor...

Don't be too hard on Google's acquisition of Deja. I wasn't working at Google at the time, but heard from many colleagues when I joined that the Deja acquisition was quite chaotic because Deja was just about to shut down entirely when Google picked it up rather than let it disappear. It's a shame they don't do better with the Usenet archive now, but it's clearly not Google's business anymore.

Dejanews was simply fantastic. When Google embraced and extinguished it and then later killed the discussion filter from its front page I finally realized the Internet as we knew it until the late 90s/early 2K was dead. It transitioned from an instrument I could use to find people exchanging genuine opinions on stuff to a way to inundate my search result with biased people promoting or selling that stuff.
I remember dejanews, but don't remember liking the experience. I remember being very excited when Google acquired them.

I remember being super bummed when bitrot set in. Posts I knew existed just couldn't be found.

As someone who got their first car running by way of Usenet archives (a 69 Beetle), I always loved how well it was archived compared to all the web bulletin boards that fragment so much information as they quickly rot year after year.

For all of you who are misty-eyed pondering those wonderful, probably-lost-forever Usenet posts from the 90s, I dare you to read Kibo's .signature (last updated 5/5/94 4:52AM <-- CINCO DE MAY-O !!!!) at http://archive.birdhouse.org/etc/kibosig.txt to cure yourselves of misplaced misty-eyed nostalgia for those long-ago times when the Internet was something else.

Kibo for President!

What did I just read. It’s like trying to understand satire from a hundred years ago.

I was convinced it was hopelessly corrupted by the archiving process, but some of the ASCII art still works (mostly).

It's concentrated 90's memes, straight from the Gen X tap. Can be hazardous in large doses.
It's a usenet signature that grew longer and more ridiculous over time, with the changelog slapped on top.

It really is archaic; we don't use signatures for anything anymore (email, barely).

Thank you!

If I read it right, humanity has lost 1991-2003 to Google though, correct?

Take a look here:

https://archive.org/details/usenethistorical

"This historical collection of Usenet spans more than 30 years and was given to us by a generous donor"

This group for example, from that collection:

    https://archive.org/download/usenet-comp/comp.emacs.mbox.zip
    [69.8M]
includes posts spanning from December 1988 to June 2013.

For some reason the mbox files have an odd format, with From lines that look like:

     From -8118066241627336028
I wrote some scripts to fix that so I could open the mbox files in Mutt.

BTW, found the above links on this page:

http://ryanfb.github.io/etc/2015/02/23/early_usenet_history_...

which has more info & links about historical usenet archives.

Oh wow, does this mean that we have all of usenet texts? What stops people from providing a better interface than Google's then?
> After a month I had most newsgroups - excepting binaries - and it came to 800GB.

> Trying to index THAT lot was impossible

Stupid question, but why would indexing 800GB of newsgroup postings be impossible?

Clucene was way too slow for body text, more than 1GB. I had my own header parser in C++ (though you can do that in Python easily).

I'm trying again on that 800GB with KISS DB (append-only hashtable), and Elasticsearch. Doesn't matter if GPL because it's a website.

Do you mind sharing the code ? I think that is an interesting thing to see
Reddit is the current era’s equivalent of Usenet, and we don’t have a robust archive of that either.
wayback machine's archive of reddit isn't perfect but it works. just give the IA more money
Yes, but much of the Wayback Machine’s reddit content was specifically targeted and scraped by ArchiveTeam, who are volunteers that seek out at-risk content from the web and make sure that it gets into the Wayback. In the past few years we’ve specifically tried to go after sub-reddits that we thought were newsworthy and/or at high risk for deletion. But there’s no way we can get all of it.

But you can help! If you have extra server space/bandwidth or you can spare $40/month, we can add more pipelines: https://www.archiveteam.org/index.php/ArchiveBot

Source: am ArchiveTeam member, run various pipelines, have scraped sub-reddits ranging from The_Donald to the cryptocurrency worlds to darknet markets.

It's great to see some Usenet archives out there to partly make up for the disappointment of Google Groups. But I'm sad that this archive seems to be incomplete, even within its stated date range. Back in the day, I was active on rec.arts.books.tolkien and alt.fan.tolkien: in this archive, I can't find any trace of the massive "alt." hierarchy at all, and the list of files for the "rec." hierarchy doesn't include the Tolkien group. For that matter, the list includes rec.humor.funny and rec.humor.d and others, but apparently not rec.humor itself. (It really does make you appreciate just how substantial the effort of collecting a comprehensive Usenet archive would be.)

On another note, not that anyone here would be able to fix it, but this list would be a lot easier to search through if the item names didn't all begin with "Usenet newsgroups within", so you could jump to first letters in a meaningful way.

It's my fault (not the IA), I thought I'd got all newsgroups but must have missed some. I just checked the main newsgroup list and it's incomplete, for some reason.

My plan though, is to dust off the old code, get a complete list of groups, get them, and then make it searchable.

Sorry about that, I didn't check. This was all done in 2013. Basically, I wanted to build a search engine but indexing the newsgroup posts (for header and text body search) would take too long. I abandoned it, then in 2016 I sent it to the IA.

So, I only just found out it's lacking a bunch of groups, 5 years on...

This is much appreciated....:-) I should like to note that in the case of the comp.sys.amiga.* news groups, that these were first established Jan 8, 1991. The first posts in the mbox archive for comp.sys.amiga.programmer, however, are from May 31, 1994. It looks like the first three years might be missing, at least for this group. I haven't yet checked the other amiga groups yet.
No apologies necessary: creating the archive in the first place was awesome! Like I said, this just shows how massive and complex Usenet is (or maybe, was), and how it's not easy at all to create a comprehensive archive. It's far better to have some of it than none of it!
From what I can gather, I downloaded one tenth of Usenet - 11,000 groups. This means if I'd done all 110,000 groups it would have taken me about a year to download them, and an 8TB drive to store them (in 2013!). That wasn't really feasible...
It looks like the downloads are grouped by top level group.

So you can go to https://archive.org/download/gna-rec and see all archived groups that are in the rec hierarchy.

Right: As I said earlier, the list in this archive includes rec.humor.funny but not rec.arts.books.tolkien. (And I seem to recall that the alt.* hierarchy was a little less broadly propagated, which might possibly be related to why it wasn't included here.)
It looks like alt is missing, sadly.
There are also groups within an existing hierarchy missing. For example. rec.games.roguelike.development is there but rec.games.roguelike.nethack is missing.
I'm impressed that Giganews maintains a 10-year archive of Usenet, I suspect that would break most newsreaders.
A lot of the big usenet providers have at least a decade's worth of article retention at this point (even for binary newsgroups).
Usenet was my very first exposure to internet communities. I read and posted on alt.games.nintendo.pokemon between ages 9 and 13, then moved to Something Awful after that.
I'm gonna put my dinosaur hat on and remind everyone that you can still use USENET in 2018. It's alive, well, and if you don't need access to binary groups, it's also free and very straightforward.

There are still some intresting discussions going on, mostly in the technical groups.

I still open it up maybe once a month or so for nostalgia, though I haven't posted in a while.

Search query to the void, but if anyone has archives of the umbc.* hierarchy, I'd be eternally grateful to see them.
Usenet article periodically get promoted to the front page. I wonder what that says about the age demographics of HN, given that Usenet hasn't been significant in maybe 20 years (and even then it was a niche). Is the younger generation here? And if not here, where?
I'm kind of offended that you seem to think young people can't be aware of things that happened before their time.

EDIT: To elaborate a bit, my expectation would be that the sort of young person using HN leans more historically interested than normal. Is more likely to appreciate the value of things like Internet archive, etc. Usenet is a huge part of digital history, and the fact that it's not available even though archives were kept is something of a tragedy.

I’m too young for Usenet, but I think it’s an interesting bit of history. I like how open it seems relative to current social media.