| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by diab0lic 4000 days ago

Service owners wanted the system to be able to identify these problems as quickly as possible so the initial goal was to go as small as possible, however the initial data source rolled data up into 1 minute windows so we figured 5 minutes was the shortest we could get away with. In practice this seems to work well for service owners as it avoids calling out small spikes as outliers.

We've got an experimental streaming version going that hasn't been set loose on any services yet, but it can get much higher granularity metrics ~ 10s (faster if we cared to).

edit: Forgot to add that we did try other window sizes, as long as 30 minutes but we found that longer windows allowed the past to influence the decision being made now too much. If it has spiked in the past we were aggressive about calling it an outlier with 30 minute windows, furthermore if it had been in lying and just become an outlier it killed our time to detect which is an important metric for us.