Hacker News new | ask | show | jobs
by ElevenLathe 236 days ago
Do you have any advice about EDA use cases? I have a management that is clearly interested in it -- presumably because it sounds like robots doing work that currently expensive professionals do -- but so far I haven't found a recurring alert that could benefit from this, since it's always much easier to fix whatever bug in the application is causing those states to occur.
1 comments

A lot of what we used it for was to compensate for non-responsive dev teams or vendor apps that we couldn't fix. EDA felt like a tool for the powerless. We can't fix the real problem, but we can automate the band-aid. I was on an ops team dealing with hundreds of apps and teams. They didn't care of ops had to restart their app constantly, it wasn't their problem.

We had a lot of alerts where restarting a service would fix it, so we had EDA do that. That effectively freed up 3 resources to do other things just for a single application we monitored.

We have some EDA for disk cleanup, to delete files in some "safe" directories common to the OS, not applications. More often than not, the disk space issues are due to the application team and they really need to clean things up or make their own cleanup job. If you're the application owner you can be much more targeted, where I had to write something that would work for hundreds of different app teams. But of course if you own the app you can fix the excessive logging issues at the source, which is even better. Some vendor apps left a lot of junk out there. We'd clean up BladeLogic temp files (back when we used that) and of course temp directories that people never bothered to clean up.

Another thing we've used it for was to enrich the data in the alert ticket. If the first thing you do when getting a certain alert is to check various logs or run certain commands to get more information, have the EDA do that and put that data right in the ticket so you already have it. One simple example we had was for ping alerts. In many cases a ping alert would clear on it's own, so we added some EDA to check the uptime on the server and put that information into the ticket. This way the ops person could quickly see if the server rebooted. If that reboot was unexpected the app team should be made aware and verify their app. Without that, a clear alert would be assumed to be some network latency and dismissed as "noise".

Depending on how quickly an EDA band-aid can role out vs the fix, EDA can also buy you time while you implement the real fix, so you're not bogged down with operational work. This is especially true if the real fix for a problem will require massive changes that could take months to actually implement.

For a while we had a lot of issues with BTRFS filesystems, and we ended up making some EDA to run the btrfs balance when we started getting in the danger zone, to avoid the server locking up. This was a way to keep things under control and take the pressure off of the ops team while we migrated away from btrfs, which was a multi-year undertaking.

Reporting on your alert tickets should highlight any opportunities you might have, if they exist. If you have an ops team, ask them too, they'll know. But of course, if they can be fixed in the code, or the monitor just needs to be tuned, that's even better. EDA should be a last resort (other than the ticket enrichment use case).

The danger of EDA is that it can push problems to the back burner. If there is a chronic issues and EDA is taking care of it, the root causes is never resolved, because it's no longer the squeaky wheel. If you go down the EDA route, it is a good idea to report on how often it runs, and review is regularly to drive improvements in the app. High numbers of EDA resolved alerts shouldn't be the goal. Ideally, those EDA metrics should also be driven down as the apps improve, just like you'd want to see from alerts being handled by humans. At the end of the day, they are still undesirable events that reduce the stability of your environment.

Love this...meaning simply reusing a quick fix is definitely not ideal to help identify root causes...LLMs have come a long way and I feel with adequate tooling and context(the rich ticket data you mentioned), they could really be a great solution or at least provide even better context to developers
This makes sense. The whole idea is catnip for empire-building ops managers: makes them look proactive while also building in a dependency on a new system that only ops knows anything about.