|
|
|
|
|
by patio11
4061 days ago
|
|
In case any other customer is wondering "Wait, I didn't hear anything from my monitoring about that and I'm retroactively worried. How worried should I be?" like I was: I just pulled our logs and reconstructed them, and it shows over the last ~30 days that the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case. This didn't trip our monitoring at the time because they all completed successfully. n.b. Our backups run outside of the hotspot times for Tarsnap, so we may have had less performance impact than many customers. I have an old habit of "Schedule all cron jobs to start predictably but at a random offset from the hour to avoid stampeding any previously undiscovered SPOFs." That's one of the Old Wizened Graybeard habits that I picked up from one of the senior engineers at my last real job, which I impart onto y'all for the same reason he imparted it onto me: it costs you nothing and will save you grief some day far in the future. |
|
"AccuracySec=" in *.timer files lets you specify the amount of slack systemd has in firing timers. To quote the documentation "Within this time window, the expiry time will be placed at a host-specific, randomized but stable position that is synchronized between all local timer units."
You may still want to randomize timers locally on a host too, but the above makes automated deployment of timers that affects network services very convenient.