Hacker News new | ask | show | jobs
by throwaway4good 2114 days ago
I am particular curious about data storage.

Does it use a traditional relational database or another existing database-like product? Or is built from scratch just sitting on top of a file system.

1 comments

Nope, you don't really need a database. What you need for fast, scalable web crawling is more like key-value storage: a really fast layer (something like RocksDB on SSD) for metadata about URL's, and another layer that can be very slow for storing crawled pages (like Hadoop or Cassandra). In reality, writing directly to Hadoop/Cassandra was too slow (because it was in a remote data center) so it was easier to just write to RAID arrays over Thunderbolt, and sync the data periodically as a separate step.