| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by freshbob 2290 days ago

The author seems to think completely in terms of "wasted utilization" when it comes to timeouts. I think they are missing the point of the timeouts and the retry logic to begin with. The effort by the circuit breaker isn't wasted, because it is exactly trying to establish whether a resource is responding or not taking into account occasional network hiccups. If every effort past the initial timeouts was wasted, then why implement this logic to begin with? I agree with derefr (https://news.ycombinator.com/item?id=22546241) in the sense that it seems illogical to increase latency for users simply to check for availability of a timed-out resource.

IMHO the worst-case assumption of all service instances failing simultaneously leads the author astray in their quest to reduce "wasted utilization".

Pretend the network switch rebooted and all services were unavailable for a short period of time, but your website is in high demand, so the error threshold of three errors per resource was quickly reached. Let's pretend the network switch needed 5 seconds to reboot, so 42 resources each failing 3 times in that time equals 126 requests/5 seconds, 25.2 requests/second. Now, instead of quickly recovering from that state after two seconds, the author advises to instead wait 30 seconds, so that's 756 requests---because your site is so popular---before the first service is retried. Then an additional 41 requests (~1.67 seconds) until all resources are marked available again. So now you made about one thousand people unhappy in case it's their browsing session that's constantly lost. Unless of course your were too optimistic when setting the half_open_resource_timeout, because then your services might be blocked for multiples of error_timeouts, e.g. minutes with a high error_timeout value of 30 seconds. That's a lot more than a thousand people unable to log in.

IMHO setting the half_open_resource_timeout way lower than the regular service_timeout value will just risk the services _never_ becoming available again after an internal network outage in your data center. That seems like a recipe for disaster.