Hacker News new | ask | show | jobs
by gorbog 2341 days ago
How can they crawl the entire web ( or a big portion of it) everyday without spending tens of thousands of dollars in server / bandwidth costs?

I know there are lots of open source crawlers like Stormcrawler and all, but the cost of running it at web scale everyday is prohibitive isn’t it?

3 comments

I built something a prototype of this a year ago. But I ultimately decided that I couldn't ever see myself making close to as much money from it as I already do. I won't be sad if this service proves that wrong.

I just told customers to sign up for Google Alerts for the general web, and then I scraped new content from a bunch of sites like Reddit, Pinterest, Twitter, Instagram, Product Hunt, Hacker News.

The idea was to order everything by how many people might be seeing it and how many people were interacting with it (dis/liking).

Since most things posted on social media basically go unnoticed, I crawled things in a way that you would only see stuff any reasonable business would care about. Because of that, even for a website like Reddit, I could get all the content for an hour in only a couple thousand hits -- bandwidth wasn't anything crazy.

As a business, this gave you the ability to prioritize what you should pay attention to and where -- without giving you a bunch of noise.

With the way web hosting servers are set up, and specific companies that sell this hosting, bandwidth for crawling/scraping is basically $0 at most popular companies.
I suppose they query Google.
If that's the case just use Google alerts.