|
|
|
|
|
by dilatedmind
1641 days ago
|
|
I worked on a project which required some medium scale web scraping (less than 100 million pages), and went with node primarily because of puppeteer. The system had a couple dozen worker processes doing the scraping, and one coordinator which maintained a queue of pages which needed to be scraped. There was some logic to balance requests between sites, so we weren't making more than a request/s to any in particular. The coordinator just had a rest api endpoint, which the workers would hit to get their next job and to return w/e data. Each worker process was ran on a separate aws instance, believe it was a t2 with unlimited cpu enabled. These are only a few dollars a day, and it was necessary to have as many ip addresses as possible (at least 5% of the sites we were scraping had some preventative measures in place, but they all seemed to be ip based) |
|
I wonder if these kind of processes are cheaper on Lambda.