Hacker News new | ask | show | jobs
by hnexamazon 272 days ago
I was an SDE on the S3 Index team 10 years ago, but I doubt much of the core stack has changed.

S3 is comprised primarily of layers of Java-based web services. The hot path (object get / put / list) are all served by synchronous API servers - no queues or workers. It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career.

For a get call, you first hit a fleet of front-end HTTP API servers behind a set of load balancers. Partitioning is based on the key name prefixes, although I hear they’ve done work to decouple that recently. Your request is then sent to the Indexing fleet to find the mapping of your key name to an internal storage id. This is returned to the front end layer, which then calls the storage layer with the id to get the actual bits. It is a very straightforward multi-layer distributed system design for serving synchronous API responses at massive scale.

The only novel bit is all the backend communication uses a home-grown stripped-down HTTP variant, called STUMPY if I recall. It was a dumb idea to not just use HTTP but the service is ancient and originally built back when principal engineers were allowed to YOLO their own frameworks and protocols so now they are stuck with it. They might have done the massive lift to replace STUMPY with HTTP since my time.

5 comments

Rest assured STUMPY was replaced with another home grown protocol! Though I think a stream oriented protocol is a better match for large scale services like S3 storage than a synchronous protocol like HTTP.
Partitioning is based on the key name prefixes, although I hear they’ve done work to decouple that recently.

They may still use key names for partitioning. But they now randomly hash the user key name prefix on the back end to handle hotspots generated by similar keys.

> The hot path (... list) are all served by synchronous API servers

Wait; how does that work, when a user is PUTting tons of objects concurrently into a bucket, and then LISTing the bucket during that? If the PUTs are all hitting different indexing-cluster nodes, then...?

(Or do you mean that there are queues/workers, but only outside the hot path; with hot-path requests emitting events that then get chewed through async to do things like cross-shard bucket metadata replication?)

LIST is dog slow, and everyone expects it to be. (my research group did a prototype of an ultra-high-speed S3-compatible system, and it really helps not needing to list things quickly)
It's not all java anymore. There's some rust now, too. ShardStore, at least (which the article mentions).
"It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career."

can you give some numbers? or at least ballpark?

Tens of thousands of TPS per node.