|
|
|
|
|
by vesterr
6627 days ago
|
|
Ha, with EC2 you can get around that now. Separate your website IP from your crawling IP, and every time CL blocks an IP, switch to a new one. Eventually they'll have to block the entire AWS range, but that's okay, you can crawl over cablemodem/DSL connections that use DHCP. What are they going to do, block Verizon, Comcast and Time Warner? Then you can get around referer [sic] blocks for your links to CL that the user clicks on by using a redirect, I think. You can slurp the entirety of CL daily without causing them traffic problems. I mean it's equivalent to each page getting one page view per day, which is nothing. Just keep track of URLs so you only slurp new content, and serve thumbnails off your own hosts (it's fair use). |
|