Which form of jitter is better: adding a random wait time to a predetermined wait time that grows exponentially with each retry attempt, or following something like [0] where every retry attempt increases the possible wait time choices and the "jitter" is to randomly pick one of them?
To illustrate the latter option, suppose the smallest retry time unit is 1 second. The first attempt gives you a random choice in {0, 1}. The second attempt gives you a random choice in {0, 1, 2}. The third attempt gives you a random choice in {0, 1, 2, 4}. The fourth attempt gives you a random choice in {0, 1, 2, 4, 8}. This goes on until a ceiling in the number of attempts or a set wall clock time is reached.
It depends on what you want. The first option will tend towards longer waits, and the second towards shorter waits.
I would tend towards increasing the minimum wait at each iteration (until you get to some maximim wait), because it it failed 10 times in the last minute, it's likely going to fail many times in the next minute, so we don't need to try more than once or twice.
Also: in case the client random is broken, you don't want to accidentaly end up with everyone retrying after zero seconds forever.
exponential backoff will still kill the servers because the first thing people do is kill/restart the app or reload the webpage and it will just restart the backoff again
So just set 'backoff until [timestamp]' rather than 'backoff for [time interval]'. Generally users restart because they don't know what's going an assume the client is stuck in loop or something. Think of a client that can say 'internet is up but my.server is having [local, regional, global] problems. I will try again at 11:37am EST.'
Downforeveryoneorjustme is exploring an API for service monitoring and it seems to me like every cloud client should have a standardized approach of trying to reach its own server, then checking internet access, then checking a service monitor, then checking social media status updates.
Also think apps and devices should have limited peer-to-peer information sharing instead of only talking to the operating system. In many ways our devices are like a roomful of people whispering status updates to the operating system and/or user but never talking to each other because 'security'.
I make a mobile app and we do this. When the app wakes up it tries to GET a file from /.well-known and if that works, proceed. If it fails we have a notice with "will retry at T". Timing is random between 10 and 50 seconds
To illustrate the latter option, suppose the smallest retry time unit is 1 second. The first attempt gives you a random choice in {0, 1}. The second attempt gives you a random choice in {0, 1, 2}. The third attempt gives you a random choice in {0, 1, 2, 4}. The fourth attempt gives you a random choice in {0, 1, 2, 4, 8}. This goes on until a ceiling in the number of attempts or a set wall clock time is reached.
[0]: https://en.wikipedia.org/wiki/Exponential_backoff#Example_ex...