Hacker News new | ask | show | jobs
by sourcecodeplz 816 days ago
They crawl all the time, their instances could go down and no problem, there are still hundreds doing the same task. They consume waaaay too much traffic for the cloud to make sense financially.

Hybrid approach is best in cases like this. Use the cloud for client facing interfaces and rent dedicated servers for the spiders.

edit: even better, build your own data center instead of renting.

1 comments

In a much smaller scale, I'm working on a web crawler as well, and renting a dedicated server at Hetzner with unlimited traffic is cheaper than any VPS, or cloud offering.

8 cores, 32 GB RAM, 2x 500 GB SSD for ~€40/month — it's an older CPU but web crawlers don't spend too much time crunching numbers anyway.

What crawling framework you using?
In-house made in Elixir.

20% of a crawler is fetching and parsing pages, the remaining 80% is dealing with misconfigured, broken and non-standard web servers and HTML. Dealing with Cloudflare, Akamai and random bot-busting tools that cause more false positives than a chaos monkey. It's better to write one yourself that you can control, monitor and operate as you need, instead of relying on third-party logic. Makes sense for my business, at least.

Ah. Have so been there. But don’t really have the resources to spin something from 0. Good luck!!