Hacker News new | ask | show | jobs
by pentlander 2035 days ago
The hand rolled gossip protocol (DFDD) is not used for consensus, it's just used for cluster membership and health information. It's used by pretty much every foundational AWS service. There's a separate internal service for consensus that uses paxos.

The thread per frontend member definitely sounds like a problematic early design choice. It wouldn't be the first time I heard of an AWS issue due to "too many threads". Unlike gRPC, the internal RPC framework defaults to a thread per request rather than an async model. The async way was pretty painful and error prone.

3 comments

Are they still using Coral and Codigo as the RPC stack?
Yeah Amazon still runs on Coral, there were some recent (release a few years ago) advances on it under the hood and ergonomically. I think the "replacement" for it is Smithy[0] though it will likely just replace the XML schema and codegen and not the protocol. Honestly at this point I think it would be in Amazon's best interest to heavily invest in Java Project Loom rather than trying to convert to async.

[0] https://awslabs.github.io/smithy/

Not Codigo, but Coral, yes.
Well, there's still codigo around, but coral is quite pleasent
Yep, that was my understanding as well. It doesn't seem to be for consensus.

Although, for Frontend servers which just do auth, routing, etc - why is P2P gossip necessary for building shard map? Possibly because retrieval of configuration information directly from the vending service may be a bottleneck - but then why not gossip with a subset of peers than every peer and the vending service which is a source of truth.

Are you sure Kinesis uses DFDD [0]?

[0] Seems like a relic of years gone by https://patents.justia.com/patent/9838240

Though I no longer work for Amazon, I'm reasonably certain they use it from the description. Especially given I know for a fact that other more foundational services use it.

Why is it a "relic of years gone by"? Consul uses a similar, though more advanced technique[0]. Consul may not be as widely used as etcd, but I don't think most would consider it a relic.

[0] https://www.consul.io/docs/architecture/gossip

That patent is from when Kinesis Data Streams were originally announced to the public. Any reason not to think it uses it. Seems like it would have been a logical choice in the initial architecture and change is slow.