Hacker News new | ask | show | jobs
by techscruggs 5254 days ago
They said that they had 2TB in postgres not 2TB of total data. I imagine all of their data is probably about an order of magnitude larger. Additionally, the challenges are not as much around how much data you have, but how you want to access that data (indexes).
1 comments

Indeed. It boils down to their need for a durable cache. It's simply too expensive to try to cache every comment tree in RAM, and Cassandra's data model and disk storage layout is a really good fit for the structure of their data.
You don't need every comment tree in RAM just the last few days worth plus a few older threads that get linked back to. They are currently using 200 machines so let's say 10 of them are used to cache 1 weeks comments. 30 GB of ram * 10 machines = 300GB of cache. I would be vary surprised if they generate 200GB/week or 10TB of comment data a year.

Edit: For comparison Slashdot spent a long time on just 6 less powerful machines vs the 200+ Reddit is using. Reddit may have more traffic, but not 40x as much. And, last I heard HN just uses one machine.

PS: The average comment is small and they can compress most comments after a day or so. They can probably get away with storing a second copy of most old threads as a blob of data in case people actually open it which cost a little space, but cuts down on processing time.

Please. Reddit does 2 billion page views per month.
Yea, and Casandra was built for a company serving 500 times that.