Hacker News new | ask | show | jobs
by jashmatthews 2115 days ago
Dealing with latency variability is a Hard Problemâ„¢ and really not much to do with Ruby or process vs thread parallelism.

https://dl.acm.org/doi/pdf/10.1145/2408776.2408794

1 comments

Well, having more spare ruby processes / threads would make the app more resistant to latency variability, and could have made some incidents into nonevents.

Also, while I don't disagree that it is indeed a hard problem, I do have very good experience with an async java stack, where I didn't have to worry about things like this. As long as a sane queue limit is defined on let's say the jetty http client, if something bad happens at the other end, the back pressure would kick in by failing immediately the requests that couldn't make it into the queue. Other parts of the app would then continue to be functional.

So, I would contend that it has a lot to do with ruby high memory usage, made much worse when single-threaded, and it looks like ruby 3.0 still won't have a complete async story yet?

EDIT: I checked the link again, and it looks Jeff Dean was talking about latency at p999 or above? By "hiccup", I actually mean something that would increase avg latency by perhaps 5~10x times, e.g. avg latency of 100ms under steady state + timeout of 1 second + the remote being down. Sorry for the confusion. Here, I am lucky if people start caring about p95.

That's not an inherent property of a particular language or concurrency model, though. That's having logic to track request queue depth for a particular service or endpoint and fail fast/load shed. You can do the same in Ruby! Some would probably say this is what a service mesh is for.

Maybe you're thinking of the new Actor based model for compute parallelism? Async IO in production Ruby has been a thing for easily more than a decade.

Of course it is not an inherent property of a particular language or concurrency model, but it is a property of a particular language ecosystem. As a turing complete language, everything is doable in ruby, but at what cost? Now we are back to trade-offs I listed above.

As for async IO in production, looking at the client library, https://github.com/socketry/async-http is barely 3 years old, and probably reached the production-ready state a few months ago, if we are being generous.

But good point about service mesh. Moving the circuit breaker responsibility to the service mesh would definitely help in my case, as the sidecar would have all the data points from the 10+ single-threaded ruby processes running in the same pod, and thus could make a much quicker decision.

https://github.com/socketry/async-http is just a new client library. PostRank, Zynga, CloudFoundry were all running async IO in production on Ruby ~10 years ago. CRuby support for non-blocking IO dates back to the mid 2000s. https://github.com/eventmachine/eventmachine actually dates back even further

If you're using Unicorn then you've already got Raindrops which gives you a really simple way to do shared metrics across forked processes like in-flight requests to another service or how many of your Unicorns are busy.

EventMachine has been losing steam for awhile now, which is why I brought up Async as the new hotness. I don't think it is fair to classify async-http as "just a new client library". As of now, in the ruby ecosystem, the Async framework is the only player in town. From my perspective, it still looks pretty much unproven, but perhaps we just live inside different bubbles.

It kinda feel like we are talking past each other here. I would just like to clarify that I inherited all these different ruby apps, and I don't have the magical ability to go back in time and say "Hey, perhaps we should use an async framework from the beginning" or "Dude, enough with the monkey-patching". And even if I do, those could be bad advice, as the ruby apps are making money in production.

Anyway, thanks for the suggestion to share metrics across processes. That will definitely help with the circuit breaker decision making in my case.