| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ck2 4504 days ago
	Putting aside legal issues, you don't have any moral problems doing volumes of scraping content that is not yours?

3 comments

nathancahill 4504 days ago

Certainly not! If I was re-selling the data, maybe. But I'm generally using it for statistics and data viz. I include the source of the data and I always obey robots.txt. Sometimes I'm even able to talk with the owners beforehand to get their ok.

(Don't downvote him, it's a valid question)

link

nl 4503 days ago

Now you'll have to tell us about the project...

300TB is quite a lot, even today.

link

CamperBob2 4504 days ago

Over time, I've learned to wget every web page and content archive I want to keep. The Internet forgets.

link

reeses 4503 days ago

In an earlier age, I ran everything through squid to consolidate browser caches. About five minutes after setting it up, I realised that pulling all the references in the log file and then indexing the lot with htdig would be tremendously useful when I was on the road without internet access.

I spent way too much time pruning stupid crap such as slashdot and started to learn this 'Bayesian classifier' thing.

Your idea is much better.

link

ck2 4504 days ago

That's personal use, I have no problem with that. The above project sounds commercial in nature.

link

ryguytilidie 4504 days ago

That seems pretty presumptuous...

link

diminoten 4504 days ago

Why should he? It's publicly available information.

link