|
|
|
|
|
by nerdponx
2820 days ago
|
|
Web scraping "at scale" ends up being a lot more complicated than blinding firing HTTP requests. Scrapy, for example, supports arbitrary "middleware" that can, for example, follow 301 Redirect, respect robots.txt files, follow sitemap.xml files, etc. To what extent is this supported (or, to what extent do you plan to support it?) Similarly, since the front-end language is essentially a compiler, would it be possible to write an alternative "backend" (e.g. something that distributes requests across a cluster)? |
|
Out of the box, there are not scaling mechanism yet, since the project is WIP. But, it's written in Go, which makes it pretty fast.
One idea of how you could scale it is to run cluster of instances of headless Chrome, put proxy/load balancer in front of it, and get Ferret a url to the cluster. It will treat it as a single instance of Chrome. The only problem, you would need to differentiate request from CDP (Chrome DevTools Protocol) client, and once a page is open, redirect all related requests to the same Chrome instance.