| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sourcecodeplz 816 days ago

They crawl all the time, their instances could go down and no problem, there are still hundreds doing the same task. They consume waaaay too much traffic for the cloud to make sense financially.

Hybrid approach is best in cases like this. Use the cloud for client facing interfaces and rent dedicated servers for the spiders.

edit: even better, build your own data center instead of renting.

1 comments

sph 816 days ago

In a much smaller scale, I'm working on a web crawler as well, and renting a dedicated server at Hetzner with unlimited traffic is cheaper than any VPS, or cloud offering.

8 cores, 32 GB RAM, 2x 500 GB SSD for ~€40/month — it's an older CPU but web crawlers don't spend too much time crunching numbers anyway.

link

bomewish 816 days ago

What crawling framework you using?

link

sph 816 days ago

In-house made in Elixir.

20% of a crawler is fetching and parsing pages, the remaining 80% is dealing with misconfigured, broken and non-standard web servers and HTML. Dealing with Cloudflare, Akamai and random bot-busting tools that cause more false positives than a chaos monkey. It's better to write one yourself that you can control, monitor and operate as you need, instead of relying on third-party logic. Makes sense for my business, at least.

link

bomewish 815 days ago

Ah. Have so been there. But don’t really have the resources to spin something from 0. Good luck!!

link