I've done this before for large scraping projects. I find the datacenter the target website is hosted in, then get a dedicated server right next to it. I've never gotten better performance.
Certainly not! If I was re-selling the data, maybe. But I'm generally using it for statistics and data viz. I include the source of the data and I always obey robots.txt. Sometimes I'm even able to talk with the owners beforehand to get their ok.
In an earlier age, I ran everything through squid to consolidate browser caches. About five minutes after setting it up, I realised that pulling all the references in the log file and then indexing the lot with htdig would be tremendously useful when I was on the road without internet access.
I spent way too much time pruning stupid crap such as slashdot and started to learn this 'Bayesian classifier' thing.
Largest was over 300TB. I talked with the owner beforehand and got access to the internal IP address so traffic wouldn't leave the datacenter (free of cost).
I offered to help them set up an API instead of scraping, but they decided scraping was easier in the short term.
traceroute is a good starting point. Sometimes I have to try a couple different datacenters until I hit <4ms ping time. Sometimes I just ask the datacenter is website x is hosted there.