Hacker News new | ask | show | jobs
by nathancahill 4504 days ago
I've done this before for large scraping projects. I find the datacenter the target website is hosted in, then get a dedicated server right next to it. I've never gotten better performance.
3 comments

Putting aside legal issues, you don't have any moral problems doing volumes of scraping content that is not yours?
Certainly not! If I was re-selling the data, maybe. But I'm generally using it for statistics and data viz. I include the source of the data and I always obey robots.txt. Sometimes I'm even able to talk with the owners beforehand to get their ok.

(Don't downvote him, it's a valid question)

Now you'll have to tell us about the project...

300TB is quite a lot, even today.

Over time, I've learned to wget every web page and content archive I want to keep. The Internet forgets.
In an earlier age, I ran everything through squid to consolidate browser caches. About five minutes after setting it up, I realised that pulling all the references in the log file and then indexing the lot with htdig would be tremendously useful when I was on the road without internet access.

I spent way too much time pruning stupid crap such as slashdot and started to learn this 'Bayesian classifier' thing.

Your idea is much better.

That's personal use, I have no problem with that. The above project sounds commercial in nature.
That seems pretty presumptuous...
Why should he? It's publicly available information.
How large of a scraping project are we talking about in terms of throughput?
Largest was over 300TB. I talked with the owner beforehand and got access to the internal IP address so traffic wouldn't leave the datacenter (free of cost).

I offered to help them set up an API instead of scraping, but they decided scraping was easier in the short term.

dumb question, but how were you finding their datacenter?
traceroute is a good starting point. Sometimes I have to try a couple different datacenters until I hit <4ms ping time. Sometimes I just ask the datacenter is website x is hosted there.
The easiest way is usually to run a 'whois' on the IP of the server (not the domain).