| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nathancahill 4504 days ago
	I've done this before for large scraping projects. I find the datacenter the target website is hosted in, then get a dedicated server right next to it. I've never gotten better performance.

3 comments

ck2 4504 days ago

Putting aside legal issues, you don't have any moral problems doing volumes of scraping content that is not yours?

link

nathancahill 4504 days ago

Certainly not! If I was re-selling the data, maybe. But I'm generally using it for statistics and data viz. I include the source of the data and I always obey robots.txt. Sometimes I'm even able to talk with the owners beforehand to get their ok.

(Don't downvote him, it's a valid question)

link

nl 4503 days ago

Now you'll have to tell us about the project...

300TB is quite a lot, even today.

link

CamperBob2 4504 days ago

Over time, I've learned to wget every web page and content archive I want to keep. The Internet forgets.

link

reeses 4503 days ago

In an earlier age, I ran everything through squid to consolidate browser caches. About five minutes after setting it up, I realised that pulling all the references in the log file and then indexing the lot with htdig would be tremendously useful when I was on the road without internet access.

I spent way too much time pruning stupid crap such as slashdot and started to learn this 'Bayesian classifier' thing.

Your idea is much better.

link

ck2 4504 days ago

That's personal use, I have no problem with that. The above project sounds commercial in nature.

link

ryguytilidie 4504 days ago

That seems pretty presumptuous...

link

diminoten 4504 days ago

Why should he? It's publicly available information.

link

weaksauce 4504 days ago

How large of a scraping project are we talking about in terms of throughput?

link

nathancahill 4504 days ago

Largest was over 300TB. I talked with the owner beforehand and got access to the internal IP address so traffic wouldn't leave the datacenter (free of cost).

I offered to help them set up an API instead of scraping, but they decided scraping was easier in the short term.

link

brown9-2 4504 days ago

dumb question, but how were you finding their datacenter?

link

nathancahill 4504 days ago

traceroute is a good starting point. Sometimes I have to try a couple different datacenters until I hit <4ms ping time. Sometimes I just ask the datacenter is website x is hosted there.

link

nacs 4504 days ago

The easiest way is usually to run a 'whois' on the IP of the server (not the domain).

link