Hacker News new | ask | show | jobs
by saurabhjain1592 162 days ago
Hi HN. When teams move AI agents from demos to production, the failures are rarely about model quality.

They look a lot like classic distributed systems problems.

- Long-running state across multiple steps.

- Partial failures mid workflow.

- Retries that accidentally repeat side effects.

- Permissions that differ per step, not per agent.

- No clean way to stop, inspect, or intervene once execution starts.

Most agent frameworks are optimized for authoring workflows like prompts, tools, and plans. They are much less optimized for operating them once agents touch real systems, data, or users.

That is why teams often end up adding ad hoc layers after the fact. Logging wrappers. Policy checks. Manual approvals. Retry guards. Kill switches.

We built AxonFlow because these problems do not live at the API boundary. They happen inside execution paths.

AxonFlow is a self hosted control plane that sits under your LLM or agent stack and governs execution step by step across LLM calls, tool calls, retries, and approvals, without replacing your existing orchestration framework.

It supports execution aware policy enforcement, not just ingress checks.

Human approval gates for high risk actions.

Deterministic audit logs and replay and debug.

Cost controls and routing primitives.

Gateway mode alongside existing LangChain, CrewAI, or custom stacks.

The community core is source available under BSL 1.1, runs locally, and is fully self hosted with no signup and no hosted dependency.

Repo: https://github.com/getaxonflow/axonflow

Docs: https://docs.getaxonflow.com

Optional 2 minute demo showing gateway mode, policy enforcement, and a multi step workflow running locally: https://youtu.be/WwQXHKuZhxc

I would especially value pushback from folks who have dealt with retries, side effects, permission boundaries, or post incident auditability in real agent workflows.

2 comments

I really liked the idea about the clean way to intervene with the agent flow and a better way to inspect the logging trail and I feel open source is the way to go
Thanks, appreciate that.

The intervention point ended up being more important than we initially expected. Once workflows become multi-step and stateful, the ability to pause, inspect, or halt execution based on context (not just inputs) becomes the difference between “we noticed later” and “we prevented it.”

We also found that logging alone wasn’t enough. Being able to see why a step was allowed to proceed, not just that it did, made post-incident analysis much less speculative.

Curious if you’ve seen similar issues once agents start touching real systems.

Demo looks sick! Good luck!