Hacker News new | ask | show | jobs
by ziflex 2818 days ago
This package is more like a runtime. There are plans to create a dedicated server, where you would be able to store your queries, schedule them and set up output streams like Spark or Flink. For now, it does not respect robots.txt. But it can be easily added.

Out of the box, there are not scaling mechanism yet, since the project is WIP. But, it's written in Go, which makes it pretty fast.

One idea of how you could scale it is to run cluster of instances of headless Chrome, put proxy/load balancer in front of it, and get Ferret a url to the cluster. It will treat it as a single instance of Chrome. The only problem, you would need to differentiate request from CDP (Chrome DevTools Protocol) client, and once a page is open, redirect all related requests to the same Chrome instance.