Hacker News new | ask | show | jobs
by AznHisoka 3566 days ago
I've crawled over a billion pages over a stretch of 3 years or so. Crawling is the easy task and just crawling a billion pages wouldn't cost more than a few thousand a month. Add a couple more thousand for storing these pages in a search index and database.
2 comments

Do you have a company that does this ? Can you advice me about this? I like the crawling thing, I would like to know how to monetize this.

Thanks!

Would you be able to share what your stack was? and the resources it took? Thanks a lot.
Ruby and Sidekiq as the messaging queue

Postgres to store the data

Elasticsearch as a search index.

My ES cluster has around 10 nodes, 64 GB RAM, quad-core.

Postgres database cluster is 4 nodes, 1 TB, 64 GB RAM, quad-core.

800 crawler threads distributed across 10 dedicated servers.

Thanks a lot! This sounds reasonable. Did you guys look into professional services for this?
Nope. We have lots of custom needs.
Just in case you don't know common-crawl makes available a huge crawl dataset
Common Crawl is great! however, some use cases require larger crawls with a higher frequency.