Hacker News new | ask | show | jobs
by binux 4236 days ago
the architecture of pyspider: http://blog.binux.me/assets/image/pyspider-arch.png

And yes for centralized queue which is in scheduler. It's designed to satisfy about 10-100 million urls for each project.

scheduler, fetchers, processors are connected with rabbitmq(alternatively). Only one scheduler is allowed. But you can run multiple fetchers or processors as needed.

1 comments

Will it be a good fit if I, running on a hundred servers, need to scrape just the home page of a million sites? No analysis of the pages, that is done later.
The fetcher fit you already...
You are running

   phantomjs phantomjs_fetcher.js
and using it as proxy? The setup instructions are a bit unclear on this.
I want to make it a http proxy in the beginning. But I found it hard to do so. Then I post every to it, but haven't change the name.

But it works like a proxy, that any request with `fetch_type == 'js'` would be fetched through phantomjs and the response back to tornado_fetcher.