|
Good question! And few people get to work at this scale, so it's not an unreasonable guess. I'll join you in speculating wildly about this, since, hey, it's kind of fun. IMHO sharding traffic by subreddit doesn't pass the smell-test, though. Different subreddits have very different traffic patterns, so the system would likely end up with hotspots continuously popping up, and it'd probably be toilsome to constantly babysit hot shards and rebalance. (Consider some of the more academic subreddits vs. some of the more meme-driven subreddits — and then consider what happens when e.g. a particular subreddit takes off, or goes cold.) Sharding on a dimension that has a more random/uniform distribution is usually the way to go. Off the top of my head (still speculating wildly and basically doing a 5-minute version of the system-design question just for fun), I'd be curious to shard by a hash of the post ID, or something like that. The trick is always to have a hashing algorithm that's stable when it's time to grow the number of shards (otherwise you're kicking off a whole re-balancing every time), and of course I'm too lazy to sort that out in this comment. I vaguely remember the Instagram team had a really cool sharding approach that they blogged about in this vein. (This would've been pre-acquisition, so ancient history by Silicon Valley standards.) As for subreddit metadata (public, private, whatever), I'd really expect all of that to be in a global cache at this point. It's read-often/write-rarely data, and close-to-realtime cache-invalidation when it does change is a straightforward and solved problem. |