Hacker News new | ask | show | jobs
by InclinedPlane 5088 days ago
What does content size matter? The challenge is that every single page of content except for each individual tweet is utterly unique for every user. That defeats the vast majority of straightforward caching implementations. You can't cache fully rendered pages ever because the chance that one random timeline view at a given time will be identical to any other view (even by the same person at a different time) is pretty much as close to zero as possible. Every view is dynamically generated content from up to several hundred or thousand different streams of data and needs to be put in order and have all of the per-user metadata set correctly.

Once you start looking into the actual mathematical constraints of the problem of twitter you realize that it's a scaling nightmare. Hundreds of millions of updates per day and tens of thousands of views per second (billions per day). There's only a few people in the world who have the right to look down on stats like that.

1 comments

Again, as the parent poster also posted, I think you have never worked on large data. Twitter is like a big mailbox, only that every mail only has 160 bytes. This has been solved 10 years ago.
If you don't understand that the request distribution matters more than payload size, you aren't even seeing the problems.

I encourage you to analyze infrastructure for a twitter style app using inbox duplication. Once you model this against hardware costs you'll learn something about how utterly expensive write amplification is in a hot data set that must be backed by ram due to availability requirements.

Wait, see my comment below. Twitter received 15B (yes, B) API calls/day last July. How does that compare to your typical email client?

I don't want to argue that Twitter is astoundingly hard, but serving ~170K requests/sec can't really be that trivial, even if they're 160 bytes (they're not, since Twitter sends metadata, logs those messages, tracks service metrics, etc. for those messages)

If you treat twitter like a big mailbox, things will work "ok". It's not the worst approach ever, that's for certain. But end-user perceptible performance would be a fraction of what twitter has today.

P.S. How many images does twitter serve up per day at present? That's a tad more than 160 characters of data.

Another key difference is that email users generally contribute directly to their provider's infrastructure costs in providing email as a service. Email infrastructure (and the user experience) is fragmented, and global funding generally scales with global load.

Instead twitter must monetize via advertising of some form, and so the percentage of folks who do not respond to ads acts as a really strong factor in your cost calculations. In this sense, email software has it easy, and can be extremely wasteful in the resources it consumes.

It's not just that the availability expectations of twitter are higher than email, it's also that the economic base of the infrastructure is far more sparse.

And yet another key difference is that there are few email server installations that support half a billion users. Saying that the scaling problem is "solved" because all you have to do is copy, say, gmail, is kind of silly.