Hacker News new | ask | show | jobs
by timaelliott 5088 days ago
Twitter's attitude is laughable. Yes, you have a decent amount of traffic but you're also only dealing with ~160 uncompressed bytes (plus whatever overhead) per event. The hurdles you've overcame aren't that particularly amazing nor challenging.
4 comments

Your assumptions are laughable. Tweets have considerable metadata that pushes them far beyond 160 bytes. Fragmenting this into a secondary object is counterproductive due to the constant factor of 2 requests vs one larger payload.

Someone who's been around the block a few times understands that it's difficult to make pronouncements without informed observation. That you are not willing to extend twitter's engineering staff the benefit of the doubt considering your lack of visibility into their measurements speaks loudly.

What does content size matter? The challenge is that every single page of content except for each individual tweet is utterly unique for every user. That defeats the vast majority of straightforward caching implementations. You can't cache fully rendered pages ever because the chance that one random timeline view at a given time will be identical to any other view (even by the same person at a different time) is pretty much as close to zero as possible. Every view is dynamically generated content from up to several hundred or thousand different streams of data and needs to be put in order and have all of the per-user metadata set correctly.

Once you start looking into the actual mathematical constraints of the problem of twitter you realize that it's a scaling nightmare. Hundreds of millions of updates per day and tens of thousands of views per second (billions per day). There's only a few people in the world who have the right to look down on stats like that.

Again, as the parent poster also posted, I think you have never worked on large data. Twitter is like a big mailbox, only that every mail only has 160 bytes. This has been solved 10 years ago.
If you don't understand that the request distribution matters more than payload size, you aren't even seeing the problems.

I encourage you to analyze infrastructure for a twitter style app using inbox duplication. Once you model this against hardware costs you'll learn something about how utterly expensive write amplification is in a hot data set that must be backed by ram due to availability requirements.

Wait, see my comment below. Twitter received 15B (yes, B) API calls/day last July. How does that compare to your typical email client?

I don't want to argue that Twitter is astoundingly hard, but serving ~170K requests/sec can't really be that trivial, even if they're 160 bytes (they're not, since Twitter sends metadata, logs those messages, tracks service metrics, etc. for those messages)

If you treat twitter like a big mailbox, things will work "ok". It's not the worst approach ever, that's for certain. But end-user perceptible performance would be a fraction of what twitter has today.

P.S. How many images does twitter serve up per day at present? That's a tad more than 160 characters of data.

Another key difference is that email users generally contribute directly to their provider's infrastructure costs in providing email as a service. Email infrastructure (and the user experience) is fragmented, and global funding generally scales with global load.

Instead twitter must monetize via advertising of some form, and so the percentage of folks who do not respond to ads acts as a really strong factor in your cost calculations. In this sense, email software has it easy, and can be extremely wasteful in the resources it consumes.

It's not just that the availability expectations of twitter are higher than email, it's also that the economic base of the infrastructure is far more sparse.

And yet another key difference is that there are few email server installations that support half a billion users. Saying that the scaling problem is "solved" because all you have to do is copy, say, gmail, is kind of silly.
Yes, you have a decent amount of traffic but you're also only dealing with ~160 uncompressed bytes (plus whatever overhead) per event. The hurdles you've overcame aren't that particularly amazing nor challenging.

>42M uniques last month.[0] Are you really going to assert Twitter hasn't dealt with amazing or challenging hurdles in getting this far?

[0] http://siteanalytics.compete.com/twitter.com/

EDIT: this ignores that twitter.com is not the only Twitter client--they served 15B (!!) requests/day (!!!) as of a year ago.

Not to mention metadata, instrumentation for services, logging, DB backups, and managing configuration of all of those distributed resources. Are we still talking about the ease of 160B?

http://www.readwriteweb.com/hack/2011/07/twitter-serves-more...

Yes.

In 2008/2009, another engineer and myself built an ad-platform that received around 500M impressions per day, 5M clicks per day. And it wasn't just recording a tweet or publishing out to followers. We took the user input query, had to do some keyword/relevancy targeting, geofiltering, matching to advertisers and deliver back a large result set of adverts. All within 100ms.

Our platform was also apache, mod_php, memcached, mysql and rabbitmq. So definitely not the most optimal of platforms by any means. We had two colos with ~20 servers (dell r410s) at each facility.

Twitter just recently announced 400M tweets/day. I'm not trying to brag about my experiences, because looking back now we made numerous amateur mistakes, but just showing that Twitter's "scale" is a joke compared to everyday challenges at any large internet ad network.

You understand that 400M tweets a day is the number of tweets posted to their system, right? That speaks not at all to the consumption of those tweets, which is the metric you're using for your ad platform.

Additionally, they don't just deal with 160 characters, because again, somehow you're still talking about data being posted, and not data being consumed. Data is consumed off their site via polling APIs, streaming APIs, and a website, all of which are pushing those 400M tweets a day out to plenty of consumers.

They may not have as ridiculous a scale as they act like they do. But let's be clear: it is nowhere near as trivial as you make it out to be, either. Armchair quarterbacking is always easy, because you aren't exposed to the complexity that arises when you've spent a few months and years hitting the corner cases of the problem you're commenting on.

So you had 500m reads on a relatively static data set + 5m writes on an unrelated log? Sounds like a fun problem, but I agree I doesn't sound like rocket science. On the other hand, it also doesn't sound like Twitter, having 400m writes per day, and 400*x million reads on that very dynamic data set. Just seems that's a slightly harder problem.
Adserving is not really static. Cachebusters are named so for good reason. Nowadays ad server developers are clever enough to separate click tracking and impression tracking (the non-Enterprise version of OpenX still deserves a lot of ಠ_ಠ though).

In an RTB environment, there is an additional constraint of having to serve up your ad (or decision) within 60ms (Google ADX sets a hard limit of 80ms), and the fastest best bid wins.

I don't think that's a less hard problem compared to Twitter, especially at high volumes. You can't just say "scale sideward!".

That said, the first link was totally misleading. I was actually quite shocked to see that Twitter only had 42M uniques per month, because a typical ad network does a lot more

EDIT: ah.. 15B requests/day makes more sense. Wtf is with the wrong stats?

Are you talking about 15B vs. the visits chart I linked? If so, the 15B number comes from API calls, which do not have to happen through the website (think of all the Twitter clients).
Requests are requests.15B is a gigantic amount.
See my edit. Your ad impressions reached approximately 3% of Twitter's daily request load last year. Note that those requests can serve up to 200 tweets + metadata.

This doesn't account for Twitter's budding ad service, which one can assume has some of the same functionality (targeted advertising, information retrieval) as traditional ad networks.

You are off by nearly 2 orders of magnitude from twitter's scale. They have billions of views per day and each of those views is a stream comprised of hundreds of different sub-streams.
Add to that, the challenges of sub-60ms RTB. All the fun!
Decent amount of traffic ?

Sorry but the only thing laughable is that comment.