Hacker News new | ask | show | jobs
by al_borland 242 days ago
I spent the first half of my career in ops, watching those alerts, escalating things, fixing stuff, writing EDA to fix stuff, working with monitoring teams and dev teams to tune monitoring, etc. Over time I worked my way into a dev role, but still am focused on the infrastructure.

The problem you’re starting to run into is that you’re seeing the monitors as useless, which will ultimately lead to ignoring them, so when there is a problem you won’t know it.

What you should be doing is tuning the monitors to make them useful. If your app will see occasional spikes that last 10 minutes, and the monitor checks every 5 minutes, set it to only create an alert after 3 consecutive failures. That creates some tolerance for spikes, but will still alert you if there is a prolonged issue that needs to be addressed due to the inevitable performance issues it will cause.

If there are other alerts that happen often that need action taken, which is repeatable, that’s where EDA (Event Driven Automation) would come in. Write some code to fix what needs to be fixed, and when the alert comes in the code automatically runs to fix it. You then only need to handle it when the EDA code can’t fix the issue. Fix it once in code instead of every time you get an alert.

2 comments

Do you have any advice about EDA use cases? I have a management that is clearly interested in it -- presumably because it sounds like robots doing work that currently expensive professionals do -- but so far I haven't found a recurring alert that could benefit from this, since it's always much easier to fix whatever bug in the application is causing those states to occur.
A lot of what we used it for was to compensate for non-responsive dev teams or vendor apps that we couldn't fix. EDA felt like a tool for the powerless. We can't fix the real problem, but we can automate the band-aid. I was on an ops team dealing with hundreds of apps and teams. They didn't care of ops had to restart their app constantly, it wasn't their problem.

We had a lot of alerts where restarting a service would fix it, so we had EDA do that. That effectively freed up 3 resources to do other things just for a single application we monitored.

We have some EDA for disk cleanup, to delete files in some "safe" directories common to the OS, not applications. More often than not, the disk space issues are due to the application team and they really need to clean things up or make their own cleanup job. If you're the application owner you can be much more targeted, where I had to write something that would work for hundreds of different app teams. But of course if you own the app you can fix the excessive logging issues at the source, which is even better. Some vendor apps left a lot of junk out there. We'd clean up BladeLogic temp files (back when we used that) and of course temp directories that people never bothered to clean up.

Another thing we've used it for was to enrich the data in the alert ticket. If the first thing you do when getting a certain alert is to check various logs or run certain commands to get more information, have the EDA do that and put that data right in the ticket so you already have it. One simple example we had was for ping alerts. In many cases a ping alert would clear on it's own, so we added some EDA to check the uptime on the server and put that information into the ticket. This way the ops person could quickly see if the server rebooted. If that reboot was unexpected the app team should be made aware and verify their app. Without that, a clear alert would be assumed to be some network latency and dismissed as "noise".

Depending on how quickly an EDA band-aid can role out vs the fix, EDA can also buy you time while you implement the real fix, so you're not bogged down with operational work. This is especially true if the real fix for a problem will require massive changes that could take months to actually implement.

For a while we had a lot of issues with BTRFS filesystems, and we ended up making some EDA to run the btrfs balance when we started getting in the danger zone, to avoid the server locking up. This was a way to keep things under control and take the pressure off of the ops team while we migrated away from btrfs, which was a multi-year undertaking.

Reporting on your alert tickets should highlight any opportunities you might have, if they exist. If you have an ops team, ask them too, they'll know. But of course, if they can be fixed in the code, or the monitor just needs to be tuned, that's even better. EDA should be a last resort (other than the ticket enrichment use case).

The danger of EDA is that it can push problems to the back burner. If there is a chronic issues and EDA is taking care of it, the root causes is never resolved, because it's no longer the squeaky wheel. If you go down the EDA route, it is a good idea to report on how often it runs, and review is regularly to drive improvements in the app. High numbers of EDA resolved alerts shouldn't be the goal. Ideally, those EDA metrics should also be driven down as the apps improve, just like you'd want to see from alerts being handled by humans. At the end of the day, they are still undesirable events that reduce the stability of your environment.

Love this...meaning simply reusing a quick fix is definitely not ideal to help identify root causes...LLMs have come a long way and I feel with adequate tooling and context(the rich ticket data you mentioned), they could really be a great solution or at least provide even better context to developers
This makes sense. The whole idea is catnip for empire-building ops managers: makes them look proactive while also building in a dependency on a new system that only ops knows anything about.
What are the tools used to implement EDA? Not sure how I would implement the automation part without writing code, which I'm trying to avoid if there are mature tools available.
We have a home grown tool. It looks at all the tickets coming in, checks for a regex match with the defined patterns, and if one matches it runs the associated script to try and resolve it. Depending on success or failure it either closes the ticket with notes or adds the notes and escalates it. That’s what I understand of it at a high level, I didn’t write it, I just used it and requested some features and changes. In the first iteration it was calling automation to remediate from a low-code/no-code orchestration tool, and these days it’s calling Ansible, but any API would work.

There are vendor solutions out there. Ansible now offers EDA as part of Ansible Automation Platform, though I haven’t been hands-on with it yet. That still requires writing Ansible playbooks, not to mention the overhead of AAP.

I don’t remember the name, but I sat in on a demo of an AI powered EDA platform probably 6 years ago (before the LLM craze). Their promise was that it would automatically figure out what to do and do it, and over time it would handle more and more incidents. It sounded a little terrifying. I could see it turning into a chaos monkey, but who knows.

Either way, there are some mature tools out there. What would work best depends on what you need to integrate with, cost, support, and how much code you are or aren’t willing to write.