| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tptacek 237 days ago

Ok, again: this isn't a cluster of load balancers in front of a discrete collection of app servers in a data center. It's thousands of load balancers handling millions of applications scattered all over the world, with instances going up and down constantly.

The interesting part of this problem isn't noticing that an instance is down. Any load balancer can do that. The interesting problem is noticing than and then informing every proxy in the world.

I feel like a lot of what's happening in these threads is people using a mental model that they'd use for hosting one application globally, or, if not one, then a collection of applications they manage. These are customer applications. We can't assume anything about their request semantics.

2 comments

__turbobrew__ 237 days ago

> The interesting problem is noticing than and then informing every proxy in the world.

Yes and that is why I suggested why your any-to-any relationship of proxy to application is a decision you have made which is part of the painpoint that caused you to come up with this solution. The fact that any proxy box can proxy to any backend is a choice which was made which created the structure and mental model you are working within. You could batch your proxies into say 1024 cells and then assign a customer app to say 4/1024 cells using shuffle sharding. Then that decomposes the problem into maintaining state within a cell instead of globally.

Im not saying what you did was wrong or dumb, I am saying you are working within a framework that maybe you are not even consciously aware of.

link

tptacek 237 days ago

Again: it's the premise of the platform. If you're saying "you picked a hard problem to work on", I guess I agree.

We cannot in fact assign our customers apps to 0.3% of our proxies! When you deploy an app in Chicago on Fly.io, it has to work from a Sydney edge. I mean, that's part of the DX; there are deeper reasons why it would have to work that way (due to BGP4), but we don't even get there before becoming a different platform.

link

__turbobrew__ 237 days ago

I think the impedance mismatch here is I am assuming we are talking about a hyperscaler cloud where it would be reasonable to have say 1024 proxies per region. Each app would be assigned to 4/1024 proxies in each region.

I have no idea how big of a compute footprint fly.io is, and maybe due to that the design I am suggesting makes no sense for you.

link

tptacek 237 days ago

The design you are suggesting makes no sense for us. That's OK! It's an interesting conversation. But no, you can't fix the problem we're trying to solve with shuffle shard.

link

otterley 237 days ago

Out of curiosity, what’s your upper bound latency SLO for propagating this state? (I assume this actually conforms to a percentile histogram and isn’t a single value.)

link