Hacker News new | ask | show | jobs
by atombender 3936 days ago
One tip: If you make your pipeline fine-grained, you have much more flexibility in terms of scheduling and parallelization, and also makes it easier to expand and design.

By fine-grained I mean that fetching, crawling, extraction and whatever other processing you're doing should be separate, discrete steps.

Example naive topology:

Fetcher: Pops next URL off a queue, fetches it, stores the raw data somewhere, emits a "fetched" event.

Link extractor: Subscribes to fetch events, extracts every URL from the data, each of which is emitted as a "link" event.

Crawling scheduler: Listens to link events, schedules "fetch" events for each URL. This is where you might add filtering and prioritization rules, for example.

Now you have three queues and three consumers, which can run in parallel with any number of worker processes dedicated to them. A naive solution could use something like a database for the events, but a dedicated queue such as RabbitMQ would fare better.