Hacker News new | ask | show | jobs
by blutoot 93 days ago
I'm a little confused. An agent's value-add is to automate what a human actor (in this case, an SRE) does and thus reduces the time taken to recovery, etc. A human SRE never manually detects an error - we already have well-established anomaly detection implementations and wiring them to some ticket generation tool is also an established pattern. My confusion is, what value the "agent" is bringing here. Nothing wrong in competing with the Datadogs of the world.
4 comments

The problem is a developer spending time to set up alerts for their new feature. I have done it many times on splunk yet that is so inconvenient. it's limited to what the developer expects the error are. example setting up status code based alert on a feature. and what happens if error alerts. A developer has to manual trace logs for a bunch of traceID. LogClaw wants to solve this issue. LogClaw 24/7 monitors your logs no need to set up alerts. when error a rise it will create a ticket with all logs for a particular traceId. No spending time on splunk/datadog log dashboard. Besides that, most of incidents happen unplanned errors on production. Those planned ones a developer has already set up a graceful way of handling them. What happens if your feature works right, but it happens to be it used frequently and Out of memory, or database queries slows, or external api exhausted ...etc and causing the error. There are many unplanned errors that LogClaw will monitor. LogClaw injects all the logs so it knows what's happening through out your whole codebase.
Logs are pretty dry sometimes.

INFO gives you a ton but it's low SNR.

WARN/ERROR may tell you that something could happen or is happening, but it doesn't tell you the ramifications of that may be. It could be nothing!

Now imagine you're getting hundreds, thousands, millions of messages like this an hour? How do you determine what's really important? For instance, if a kubernetes pod on a single node runs out of space, that could be a problem if your app is only running in that node. But what if your app is spread against 30x nodes?

It's a triage system with context, at least it sounds like it. It's helping you classify based on actual current or potential problems with the app in the ways that a plain log message does not.

Deciphering ramifications from a log message alone is a pretty unusual way to approach a problem. You still have your 1990s Nagios-style application monitoring, right? So when you wake up to a message that the web monitor says it's not possible to add items to the shopping basket right now, the database monitor signals an unusually long response time, the application metrics tells you number of buys is at a fraction of what is normal for this time of day, then that WARN log message from the application telling you about a foreign index constraint is violated is pretty informative!
The quality of your logs is critical. Our algo/LLM has no idea about your code but the "Logs". We currently push toward standardizing Otel based logs. You can read about it here https://opentelemetry.io/docs/specs/otel/logs/
LogClaw capable of injesting terabytes of logs a day. Our algorithm simply ignores successful request lifecycles which can help reduce the strains in analyzing terabytes of logs. Our algorithm then ranks and flags potential logs. later on we retrieve all logs associated with that log and analyze it more based on metrics if its worthy of a ticket/incident.
>A human SRE never manually detects an error - we already have well-established anomaly detection implementations and wiring them to some ticket generation tool is also an established pattern.

I'm currently dealing with fallout at job because we were doing all this with humans with no alerts and we missed a couple major issues. This product could have prevented a lot of stress in my case, but it'd be a bit like a bandage on a missing limb.

Exactly. Incidents happen with uncaught issue. A simple of database query slowness or out of memory ...etc can cause your "perfectly designed feature" to cause P1. So it is super convienet for a system that invests all of your logs and monitors it for you. No need to customary set up alerts or trace traceIDs, connect logs through out micro-services.
That still begs the question though: There are existing tools and solutions that do this. Why not, and would this being AI make a difference?

"My boss would be more likely to approve it" is a cynical but valid answer.

ALL existing product simply let you set up alerting system, and that alerting system is manually done by you. still un-expected issue can arise. LogClaw is not altering system. you just send all your logs, its capable of injecting terabytes of logs per day, and it automatically ignores all the successful logs, and it works on the uncaught exceptions, errors from all services, infrastructure itself.
I guess if you don’t want to have to pay for Rapid7 or are too lazy to configure the Teams/Slack integration for your EDR?

But I mean you still have to pay for a Claude API with Moltclaw or whatever no?

It's designed to be SOC 2 compliant with your existing infra. You can spin up local Ollama instead of Claude/openAI APIs. But if you can use external Claude/OpenAI APIs over local Ollama [in-cluster llm].
I am confused on the SOC2 compliance part you keep mentioning. How is it SOC2 compliant? You have completed an audit? Is that report or at least an executive summary available? Or it’s all locally hosted and shouldn’t impact my controls?

And the second part about models, if model choice doesn’t matter, what do they do? If LogClaw injests my logs, applies your custom algorithm to automatically create intelligent alerts without me having to configure anything, what does the LLM do?

If the LLMs are necessary for this, then mode choice should matter no? Some 2 year old version of Mistral or OLLAMA or NanoGPT isn’t going to perform as well as OpenAI or Claude no?

I have not done SOC 2 audit yet. LogClaw is configure to run locally and you can deploy it in your org. so technically all your data you can own them. Your logs go thru many steps. First thru ranking, only the flagged logs go to LLM usually 1-30% of your logs, LLM is used to understand the root cause and in creating a rich context incident ticket. LLM is not used to flag your logs. Currently we support standardized logs OTEL. so we can determine using our algo 99% of incidents.
Also developer configure the alerting conditions. LogClaw it automatically finds your incidents with out manual setting up alerting conditions on your log dashboard [splunk/datadog logs]