| Thanks for the thoughtful read. You’ve described exactly the maturity stage we’re targeting: past demos, dealing with retries, partial failures, side effects, and the need for real control once systems are live. On your questions: 1. Debugging and replay for stateful workflows We capture step-level execution snapshots across the workflow. Each snapshot records inputs, outputs, duration, tokens, cost, evaluated policies, triggered policies, and the resolution (approved, blocked, overridden). For enforcement-specific debugging, each snapshot includes which policy matched, what content triggered it, and how it was resolved. When a downstream step fails because an upstream step was blocked or modified, you can trace the execution timeline and see exactly where and how the data flow changed. We also support human in the loop pause and resume. A step can be paused for approval and later resumed, with the decision and rationale recorded as part of the execution history. This is not full deterministic replay yet, meaning re-running with identical LLM outputs, but it provides enough visibility to answer “what happened” and “why” in production, which covers most real debugging scenarios. 2. Latency overhead at scale We operate in two modes depending on requirements: - Compliance mode: policy violations and blocked requests are written synchronously before returning. This adds a few milliseconds for violation cases, but guarantees the audit record exists before the caller sees the result. - Performance mode: audit writes are queued asynchronously. Policy evaluation still happens inline, since it may block execution, but persistence is decoupled using bounded queues and worker goroutines. Most policies are rule-based and pattern matching rather than LLM calls. In practice, teams see single-digit millisecond overhead per request for typical policy sets. Heavier redaction or more complex policies can increase this, but the behavior remains predictable. Observe-only mode adds essentially no latency beyond the audit write, since no blocking decisions are made. On orchestration boundaries: AxonFlow does not require replacing your existing orchestrator. Most teams keep LangChain, LangGraph, or CrewAI for stateful workflow execution and use AxonFlow as a step-level control plane, adding policy gates before each step runs. For teams building from scratch or wanting tighter integration, AxonFlow can also handle orchestration end to end with governance built in. In practice, most start by adding governance to existing workflows and only consider deeper orchestration later. For related discussion on how we think about the observability to enforcement gap, there’s a deeper thread here that may be relevant:
https://news.ycombinator.com/item?id=46603800 Happy to go deeper on any of this if useful. |