|
|
|
|
|
by gorbog
2341 days ago
|
|
How can they crawl the entire web ( or a big portion of it) everyday without spending tens of thousands of dollars in server / bandwidth costs? I know there are lots of open source crawlers like Stormcrawler and all, but the cost of running it at web scale everyday is prohibitive isn’t it? |
|
I just told customers to sign up for Google Alerts for the general web, and then I scraped new content from a bunch of sites like Reddit, Pinterest, Twitter, Instagram, Product Hunt, Hacker News.
The idea was to order everything by how many people might be seeing it and how many people were interacting with it (dis/liking).
Since most things posted on social media basically go unnoticed, I crawled things in a way that you would only see stuff any reasonable business would care about. Because of that, even for a website like Reddit, I could get all the content for an hour in only a couple thousand hits -- bandwidth wasn't anything crazy.
As a business, this gave you the ability to prioritize what you should pay attention to and where -- without giving you a bunch of noise.