| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by martinkallstrom 4735 days ago
	My former startup Twingly http://twingly.com has hundreds of millions of blog posts stored (everything collected since 2006) in 128 MySql shards with a unified query interface. The last few months of data are indexed and searchable for free from their website, but the entire archive is kept forever.

2 comments

ersii 4735 days ago

That's great.

However, ArchiveTeam has uploaded all data that they've found (at least 46.23M feeds) to the Internet Archive. That means it's public for everyone to mine through and/or use.

I'm not trying to belittle Twingly here - but their "last few months of data" are maybe not really comparable to completely free and public data - kept forever.

link

porker 4735 days ago

Would you donate your data to the Internet Archive?

link

martinkallstrom 4735 days ago

Perhaps there could be some continuous rollover with all data older than five years being made available through the Internet Archive. I'm no longer affiliated with Twingly but of course know them very well. I can make a proposal! It would be a great idea and I guess for Twingly it could mean increased brand recognition.

link

Aloisius 4735 days ago

Or Common Crawl so other people could actually download and use it?

link

porker 4734 days ago

I didn't realise you couldn't download and use the data from Internet Archive. If not, that's pretty silly to back up the feeds to them, and I'm a bit annoyed to have contributed. I'd like to make them available to everyone to download, analyse, plug into their reader etc etc etc...

link

espes 4734 days ago

http://archive.org/search.php?query=collection%3Aarchiveteam...

link

zeckalpha 4734 days ago

You can from the Internet Archive. The GGP is talking about Twingly, and the discussion is about integrating their data with the Archive Team.

link

Aloisius 4734 days ago

For anything substantial (like say, their actual crawl), they'll only do it on a case by case basis with a rather restrictive license and you have to drive up there and plop down the machines to copy it onto.

link