Newsgroups and the Internet Archive: I Made a Difference | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Newsgroups and the Internet Archive: I Made a Difference (screengod.blogspot.com)
	162 points by mailxplorer 3008 days ago

13 comments

wmf 3008 days ago

It's pretty sad that a bunch of people sent historical Usenet archives to Google, they imported them into Google Groups, and then... basically hid/lost access to it over time. I assume the data is in GFS somewhere but it may never see the light of day. But maybe we just need to shame them every decade: https://www.wired.com/2009/10/usenet/

Steuard 3008 days ago

Google Groups was great when they first acquired all those archives! It just got progressively worse and worse over time, and I've never understood why. (Stagnation I could understand, but how did the same searches get less effective?)

erik_seaberg 3008 days ago

Google has a continual Red Queen's race between everyone's dependencies. I'm guessing they were lucky to keep it mostly working and avoid Reader's fate.

"We're deprecating X, please migrate to Y by this date."

"Y doesn't support all our use cases!"

"Neither will X when we turn it off, sorry."

ghaff 3008 days ago

I get that Google essentially walked back from an at least implied role as an information archivist. But I still don't really get why they so completely abandoned things like Google Groups given the truly minuscule resources to maintain them at some low level.

cultureleak_ta0 3008 days ago

Nobody can get promoted for improvements like that. It would only garner goodwill with a tiny customer base and it’s not clear how that would translate to revenue or user growth on other products

petercooper 3008 days ago

Totally agree, but I kinda feel (and this is solely my opinion) that much as it's our collective responsibility to donate to worthwhile causes, gigantic tech companies should spend rounding-error money on good digital causes like this.

wmf 3008 days ago

But there do seem to be quite a few people at Google already working on random stuff that won't get them promoted.

digi_owl 3008 days ago

What seems to have happened was that Google shifted focus from sorting the web to delivering the "zeitgeist" of the present.

Perhaps because more people were likely to search for whatever was on the news cycle than some arbitrary piece of info.

Never mind that Google's earnings comes from ads, not search results.

ghaff 3008 days ago

Google's original mission statement from 1998 was to “organize the world’s information and make it universally accessible and useful.” Things like Usenet archives are clearly part of that.

That is still apparently their mission statement but I agree the reality seems more along the lines of "deliver the best search results for users' immediate needs" [while making as much money from that as humanly possible]

macspoofing 3008 days ago

It's sad. The original mission implied nothing was going to get lost and abandoned. So much for that.

macspoofing 3008 days ago

Somebody (and most likely a team of somebodies) needs to own it. If nobody at the top is willing to champion the project and carve out a budget and people, then they need to fold it.

giancarlostoro 3008 days ago

I wish Google would donate those logs to archive.org and maybe even some servers. Idk why some of these tech giants never donate to projects that care about the internet the very thing which drives their profits.

naasking 3008 days ago

Google search in general has become less effective, so perhaps it simply carried over.

mmanfrin 3008 days ago

Search is abysmal these days. I'll look for a term with 3 words and it almost always throws out one word, giving me useless results.

nikanj 3008 days ago

”You searched for Stockholm Syndrome Research. Here’s a bunch of Stockholm travel blogs! They’re just missing two of your keywords, but they’re really popular with other visitors”

IAmEveryone 3008 days ago

I just did this search, and at least the first page only had results perfectly matching the query.

Try yourself: https://www.google.com/search?q=Stockholm+Syndrome+Research

ccozan 3008 days ago

I think is possible to have the verbatim option active all the time, but I don't remember how.

pasbesoin 3008 days ago

And the people who simply accept this have...

xvf22 3008 days ago

Quotes still force the word to be included don't they? Though I suppose that still doesn't do much if the algorithm has tossed out the matches you want.

JoshMnem 3008 days ago

Google Groups was always terrible. It almost immediately started ruining Usenet.

NelsonMinar 3008 days ago

Google's Usenet archive is too important to be owned by one company. My understanding is some Google engineers worked a few years ago to make sure Archive had a copy, but I can't verify that and I don't think it's online. Some of the earliest archives come from Eugene Spafford's collection which is readily online elsewhere, but Google did a lot of work cleaning it and of course has the DejaNews archives which are invaluable.

I'm amazed Google Groups still exists as a product, it seems abandoned internally and I expect to hear it's shut down any month now.

luckydude 3008 days ago

I miss dejanews - anyone remember that before Google bought them? They had full search, it's a shame Google let that go. Their search went back to pretty much the beginning of net news. Does that exist anywhere anymore?

NelsonMinar 3008 days ago

At least some of the Usenet archives are still online in Google Groups and searchable. Can't vouch for completeness of the search index though, it looks pretty wonky for stuff from the early 1990s. Anyway, example working archive link I found via search: https://groups.google.com/forum/#!searchin/comp.os.minix/tor...

Don't be too hard on Google's acquisition of Deja. I wasn't working at Google at the time, but heard from many colleagues when I joined that the Deja acquisition was quite chaotic because Deja was just about to shut down entirely when Google picked it up rather than let it disappear. It's a shame they don't do better with the Usenet archive now, but it's clearly not Google's business anymore.

squarefoot 3008 days ago

Dejanews was simply fantastic. When Google embraced and extinguished it and then later killed the discussion filter from its front page I finally realized the Internet as we knew it until the late 90s/early 2K was dead. It transitioned from an instrument I could use to find people exchanging genuine opinions on stuff to a way to inundate my search result with biased people promoting or selling that stuff.

oasisbob 3008 days ago

I remember dejanews, but don't remember liking the experience. I remember being very excited when Google acquired them.

I remember being super bummed when bitrot set in. Posts I knew existed just couldn't be found.

As someone who got their first car running by way of Usenet archives (a 69 Beetle), I always loved how well it was archived compared to all the web bulletin boards that fragment so much information as they quickly rot year after year.

sverige 3008 days ago

For all of you who are misty-eyed pondering those wonderful, probably-lost-forever Usenet posts from the 90s, I dare you to read Kibo's .signature (last updated 5/5/94 4:52AM <-- CINCO DE MAY-O !!!!) at http://archive.birdhouse.org/etc/kibosig.txt to cure yourselves of misplaced misty-eyed nostalgia for those long-ago times when the Internet was something else.

Kibo for President!

edraferi 3008 days ago

What did I just read. It’s like trying to understand satire from a hundred years ago.

I was convinced it was hopelessly corrupted by the archiving process, but some of the ASCII art still works (mostly).

aperrien 3007 days ago

It's concentrated 90's memes, straight from the Gen X tap. Can be hazardous in large doses.

jstarfish 3007 days ago

It's a usenet signature that grew longer and more ridiculous over time, with the changelog slapped on top.

It really is archaic; we don't use signatures for anything anymore (email, barely).

llao 3008 days ago

Thank you!

If I read it right, humanity has lost 1991-2003 to Google though, correct?

linguaz 3008 days ago

Take a look here:

https://archive.org/details/usenethistorical

"This historical collection of Usenet spans more than 30 years and was given to us by a generous donor"

This group for example, from that collection:

    https://archive.org/download/usenet-comp/comp.emacs.mbox.zip
    [69.8M]

includes posts spanning from December 1988 to June 2013.

For some reason the mbox files have an odd format, with From lines that look like:

     From -8118066241627336028

I wrote some scripts to fix that so I could open the mbox files in Mutt.

BTW, found the above links on this page:

http://ryanfb.github.io/etc/2015/02/23/early_usenet_history_...

which has more info & links about historical usenet archives.

llao 3008 days ago

Oh wow, does this mean that we have all of usenet texts? What stops people from providing a better interface than Google's then?

zokier 3008 days ago

> After a month I had most newsgroups - excepting binaries - and it came to 800GB.

> Trying to index THAT lot was impossible

Stupid question, but why would indexing 800GB of newsgroup postings be impossible?

mailxplorer 3008 days ago

Clucene was way too slow for body text, more than 1GB. I had my own header parser in C++ (though you can do that in Python easily).

I'm trying again on that 800GB with KISS DB (append-only hashtable), and Elasticsearch. Doesn't matter if GPL because it's a website.

GuacheSuedeHN 3008 days ago

Do you mind sharing the code ? I think that is an interesting thing to see

jl6 3008 days ago

Reddit is the current era’s equivalent of Usenet, and we don’t have a robust archive of that either.

brokensegue 3007 days ago

wayback machine's archive of reddit isn't perfect but it works. just give the IA more money

Asparagirl 3007 days ago

Yes, but much of the Wayback Machine’s reddit content was specifically targeted and scraped by ArchiveTeam, who are volunteers that seek out at-risk content from the web and make sure that it gets into the Wayback. In the past few years we’ve specifically tried to go after sub-reddits that we thought were newsworthy and/or at high risk for deletion. But there’s no way we can get all of it.

But you can help! If you have extra server space/bandwidth or you can spare $40/month, we can add more pipelines: https://www.archiveteam.org/index.php/ArchiveBot

Source: am ArchiveTeam member, run various pipelines, have scraped sub-reddits ranging from The_Donald to the cryptocurrency worlds to darknet markets.

Steuard 3008 days ago

It's great to see some Usenet archives out there to partly make up for the disappointment of Google Groups. But I'm sad that this archive seems to be incomplete, even within its stated date range. Back in the day, I was active on rec.arts.books.tolkien and alt.fan.tolkien: in this archive, I can't find any trace of the massive "alt." hierarchy at all, and the list of files for the "rec." hierarchy doesn't include the Tolkien group. For that matter, the list includes rec.humor.funny and rec.humor.d and others, but apparently not rec.humor itself. (It really does make you appreciate just how substantial the effort of collecting a comprehensive Usenet archive would be.)

On another note, not that anyone here would be able to fix it, but this list would be a lot easier to search through if the item names didn't all begin with "Usenet newsgroups within", so you could jump to first letters in a meaningful way.

mailxplorer 3008 days ago

It's my fault (not the IA), I thought I'd got all newsgroups but must have missed some. I just checked the main newsgroup list and it's incomplete, for some reason.

My plan though, is to dust off the old code, get a complete list of groups, get them, and then make it searchable.

Sorry about that, I didn't check. This was all done in 2013. Basically, I wanted to build a search engine but indexing the newsgroup posts (for header and text body search) would take too long. I abandoned it, then in 2016 I sent it to the IA.

So, I only just found out it's lacking a bunch of groups, 5 years on...

OSS542 3005 days ago

This is much appreciated....:-) I should like to note that in the case of the comp.sys.amiga.* news groups, that these were first established Jan 8, 1991. The first posts in the mbox archive for comp.sys.amiga.programmer, however, are from May 31, 1994. It looks like the first three years might be missing, at least for this group. I haven't yet checked the other amiga groups yet.

Steuard 3008 days ago

No apologies necessary: creating the archive in the first place was awesome! Like I said, this just shows how massive and complex Usenet is (or maybe, was), and how it's not easy at all to create a comprehensive archive. It's far better to have some of it than none of it!

mailxplorer 3007 days ago

From what I can gather, I downloaded one tenth of Usenet - 11,000 groups. This means if I'd done all 110,000 groups it would have taken me about a year to download them, and an 8TB drive to store them (in 2013!). That wasn't really feasible...

bhaak 3008 days ago

It looks like the downloads are grouped by top level group.

So you can go to https://archive.org/download/gna-rec and see all archived groups that are in the rec hierarchy.

Steuard 3008 days ago

Right: As I said earlier, the list in this archive includes rec.humor.funny but not rec.arts.books.tolkien. (And I seem to recall that the alt.* hierarchy was a little less broadly propagated, which might possibly be related to why it wasn't included here.)

astrodust 3008 days ago

It looks like alt is missing, sadly.

bhaak 3008 days ago

There are also groups within an existing hierarchy missing. For example. rec.games.roguelike.development is there but rec.games.roguelike.nethack is missing.

fencepost 3008 days ago

I'm impressed that Giganews maintains a 10-year archive of Usenet, I suspect that would break most newsreaders.

u801e 3008 days ago

A lot of the big usenet providers have at least a decade's worth of article retention at this point (even for binary newsgroups).

xor1 3008 days ago

Usenet was my very first exposure to internet communities. I read and posted on alt.games.nintendo.pokemon between ages 9 and 13, then moved to Something Awful after that.

alxlaz 3008 days ago

I'm gonna put my dinosaur hat on and remind everyone that you can still use USENET in 2018. It's alive, well, and if you don't need access to binary groups, it's also free and very straightforward.

There are still some intresting discussions going on, mostly in the technical groups.

I still open it up maybe once a month or so for nostalgia, though I haven't posted in a while.

l1n 3008 days ago

Search query to the void, but if anyone has archives of the umbc.* hierarchy, I'd be eternally grateful to see them.

forapurpose 3008 days ago

Usenet article periodically get promoted to the front page. I wonder what that says about the age demographics of HN, given that Usenet hasn't been significant in maybe 20 years (and even then it was a niche). Is the younger generation here? And if not here, where?

unimpressive 3008 days ago

I'm kind of offended that you seem to think young people can't be aware of things that happened before their time.

EDIT: To elaborate a bit, my expectation would be that the sort of young person using HN leans more historically interested than normal. Is more likely to appreciate the value of things like Internet archive, etc. Usenet is a huge part of digital history, and the fact that it's not available even though archives were kept is something of a tragedy.

edraferi 3008 days ago

I’m too young for Usenet, but I think it’s an interesting bit of history. I like how open it seems relative to current social media.