Hacker News new | ask | show | jobs
by esseph 93 days ago
Logs are pretty dry sometimes.

INFO gives you a ton but it's low SNR.

WARN/ERROR may tell you that something could happen or is happening, but it doesn't tell you the ramifications of that may be. It could be nothing!

Now imagine you're getting hundreds, thousands, millions of messages like this an hour? How do you determine what's really important? For instance, if a kubernetes pod on a single node runs out of space, that could be a problem if your app is only running in that node. But what if your app is spread against 30x nodes?

It's a triage system with context, at least it sounds like it. It's helping you classify based on actual current or potential problems with the app in the ways that a plain log message does not.

2 comments

Deciphering ramifications from a log message alone is a pretty unusual way to approach a problem. You still have your 1990s Nagios-style application monitoring, right? So when you wake up to a message that the web monitor says it's not possible to add items to the shopping basket right now, the database monitor signals an unusually long response time, the application metrics tells you number of buys is at a fraction of what is normal for this time of day, then that WARN log message from the application telling you about a foreign index constraint is violated is pretty informative!
The quality of your logs is critical. Our algo/LLM has no idea about your code but the "Logs". We currently push toward standardizing Otel based logs. You can read about it here https://opentelemetry.io/docs/specs/otel/logs/
LogClaw capable of injesting terabytes of logs a day. Our algorithm simply ignores successful request lifecycles which can help reduce the strains in analyzing terabytes of logs. Our algorithm then ranks and flags potential logs. later on we retrieve all logs associated with that log and analyze it more based on metrics if its worthy of a ticket/incident.