Hacker News new | ask | show | jobs
by outpan 3566 days ago
Would you be able to share what your stack was? and the resources it took? Thanks a lot.
2 comments

Ruby and Sidekiq as the messaging queue

Postgres to store the data

Elasticsearch as a search index.

My ES cluster has around 10 nodes, 64 GB RAM, quad-core.

Postgres database cluster is 4 nodes, 1 TB, 64 GB RAM, quad-core.

800 crawler threads distributed across 10 dedicated servers.

Thanks a lot! This sounds reasonable. Did you guys look into professional services for this?
Nope. We have lots of custom needs.
Just in case you don't know common-crawl makes available a huge crawl dataset
Common Crawl is great! however, some use cases require larger crawls with a higher frequency.