Hacker News new | ask | show | jobs
by bglazer 4181 days ago
Thanks for the reply!

1. This is a good point. I haven't incorporated sampling after the outage into the analysis, but that should be a good qualitative measure of the accuracy of the forecast.

2. I typically have good data from server logs of when an outage started. Outages during low volume periods are quite difficult to analyze though. I usually revert to just comparing the outage volume to the average volume for the whole outage period.

3. The end is typically more difficult to determine, as there's typically a period of instability as servers are restarted sporadically, followed by a "recovery" caused by the backfill pressure that you mentioned. My solution is to count any samples above the forecast's confidence interval as "recovery" and to subtract the total recovery from the loss estimate.