Hacker News new | ask | show | jobs
by sirtoffski 2303 days ago
I’ll chime in here to say we use both at work. In a NOC at a medium-sized ISP, we are getting hammered with alerts 24/7. Some are not urgent, while others need to be actioned much faster - I mean 100G transit link down is no good.

We’d receive an automatic email about a large circuit going down, we’d also receive a ticket about it; sometimes people dont look at the tickets closely enough, other times people get distracted with other topics, issues, etc. Having a large screen with interface status monitoring has proven to be effective enough; for example, someone walks by the monitor and says “why is this thing red, is it supposed to be?... and we immediately know one of the larger interfaces is down.

In an ideal world, we would not need it because every ticket will be diligently dealt with.... however in a real world, having a big red part of the screen flashing had proved quite effective.

1 comments

If you're getting alerts for non-actionable events, you need to do a better job of tuning your monitors and alerts.

Alerts shouldn't be sent about anything that doesn't require an action.

Well the thing is alerts are indeed for actionable events.

For example many remote locations have an on-site battery backup, which would supply power in an event of loosing commercial power. Those are actioned in terms of notifying field teams and deciding whether a specific location needs to be placed on a generator.

Imagine a hurricane disrupted commercial power grid and there are thousands of “site on battery” alerts; somewhere among them there is also an alert for OSPF down between two core switches.

Having a monitor with a large red warning saying “Link X at location Y is down!” - is a pretty effective way to not miss important notifications.

I mean playing devil’s advocate one might say “Then your alerts should have better filtering system with the important ones staying at the top of the page”... which is true. A lot of smart design features can render dashboards less relevant - however when there aren’t enough resources in a DevOps team to implement those solutions, a simple dashboard can go a long way!

There‘s a wide variety of „requires action“. It might be that it‘s fine to act within 1 hour or within 10min. Both deserves an alert, but only one requires you to immediately stop your coffe break...

In an ideal world, I agree. But sometimes an automated system can not perfeclty decide about the severity of an alert which leads to some alerts being ignofed, which is fine.

If an alert doesn't need action for 1 hour, the alert should happen 1 hour later.

Alert fatigue is a big problem in the sysadmin/devops world.

Yeh any one from Google here if you like to allow us to fine tune the alerts from GSC - I came in today and found 87 non useful alerts in my inbox.

I will have a look at the tool and have a play - I assume you can have multiple pages :-)

Would be cool to monitor the looks at GA " 1 2 3 4 5 …. Many" sites I have an interest in