| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by slipperyp 4222 days ago

I don't know if it's statistically valid, but I've used methods similarly based on this to do calculations like you're talking for a role similar to what you described.

There are more subtleties that it might be important for you to take into account - primarily:

1) need to sample fairly extensively before/after outage to calibrate more accurately against Holt-Winters (the Holt-Winters seasonal projection should accurately project the trend, but actual numbers are probably running at some slight or significant rate above/below projections)

2) When running those samples, it's important to sample data where you believe the data points are definitely not impacted by the outage. This is often quite challenging since outages sometimes might span low / peak traffic periods or ramp-up/down periods.

3) Finally, it can be hard to pinpoint the actual start / end of the event (the identify the time samples you want to consider in your measurement for the outage cost). Particularly the end, since there's often some pressure for queued operations (by software or by your users who are itching to complete what they were trying to do) that may make your samples fluctuate. That backfill pressure can be substantial and is important to not ignore in your measurement of the actual cost of the issue. Say you're a retail site - if you have a 15 minute period of 50% order drop but the first 5 minutes where service is restored, the total order rate was 50% above projections. Do you count that as 15 minutes of 50% order drop, or 10 minutes of 50% order drop? Both are legitimate but it's important to know what metric you're measuring yourself against so you're as correct / honest as you can be.

1 comments

bglazer 4222 days ago

Thanks for the reply!

1. This is a good point. I haven't incorporated sampling after the outage into the analysis, but that should be a good qualitative measure of the accuracy of the forecast.

2. I typically have good data from server logs of when an outage started. Outages during low volume periods are quite difficult to analyze though. I usually revert to just comparing the outage volume to the average volume for the whole outage period.

3. The end is typically more difficult to determine, as there's typically a period of instability as servers are restarted sporadically, followed by a "recovery" caused by the backfill pressure that you mentioned. My solution is to count any samples above the forecast's confidence interval as "recovery" and to subtract the total recovery from the loss estimate.

link