Hacker News new | ask | show | jobs
by jalfresi 4201 days ago
The packages don't state if the number of pages are per month, day or hour? We currently scrape well over 5 million pages an hour for a lot less (although, much like you we are geared up for such loads) but it would be interesting to see the cost per number of pages per hour you charge for odd-jobs/one-offs.
4 comments

The packages are for a set number of pages, with no timeframe. I don't have prices yet for a load in the millions of pages per hour. What kind of system do you feed data into?
A very large array of mysql databases. Basically, each month we fire up a new fresh database and start streaming data into it. Were currently pulling around 700Gb a month. Our reporting tools/systems run queries across this array. Its actually not that bad speed wise (reports of over 9000 keywords over a 1 week period for top 100 positions on a per hour basis)
If you are able to crawl that volume as cheaply as you say, you should definitely be offering it as a service.
Its a very single purpose system, so probably not amenable to general purpose crawling (though we do have a separate system that is based of the design of our core system that is a general purpose web crawler/indexer)
Are you rendering the JavaScript on those pages? What's your in-domain delay?
Yes, and injecting JS into the page for easy analysis/collection. The delay is dynamic, based on captcha rates, proxy load and historic captcha costs. It constantly checks the current running costs and throttles the number of requests over the hour.
Is there any place I can read more about the technical side of this. I would love to know how you achieve these rates.