| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by binux 4236 days ago

the architecture of pyspider: http://blog.binux.me/assets/image/pyspider-arch.png

And yes for centralized queue which is in scheduler. It's designed to satisfy about 10-100 million urls for each project.

scheduler, fetchers, processors are connected with rabbitmq(alternatively). Only one scheduler is allowed. But you can run multiple fetchers or processors as needed.

1 comments

maratc 4236 days ago

Will it be a good fit if I, running on a hundred servers, need to scrape just the home page of a million sites? No analysis of the pages, that is done later.

link

binux 4236 days ago

The fetcher fit you already...

link

maratc 4236 days ago

You are running

   phantomjs phantomjs_fetcher.js

and using it as proxy? The setup instructions are a bit unclear on this.

link

binux 4236 days ago

I want to make it a http proxy in the beginning. But I found it hard to do so. Then I post every to it, but haven't change the name.

But it works like a proxy, that any request with `fetch_type == 'js'` would be fetched through phantomjs and the response back to tornado_fetcher.

link