I was building agentic workflows for my CRM — Otter.ai recordings → Clay enrichment → CRM updates — and got tired of LLM-generated pipelines silently doing the wrong thing. A pipeline that "worked" was pushing contacts without validating email format, making API calls I didn't authorize, and failing silently when field names didn't match between steps.
The problem isn't that LLMs write bad code. It's that there's no contract between what you asked for and what runs. Structured outputs solve format. Guardrails AI solves content safety. Temporal solves execution. Nobody checks whether the workflow itself makes sense as a pipeline.
So I built a verification layer. The LLM outputs a workflow AST via structured outputs. Before anything executes, the engine type-checks data flow across steps, validates schemas at boundaries, and requires every side effect (API calls, DB writes, webhooks) to be explicitly declared. You get a manifest — "this workflow READs from Salesforce and WRITEs to HubSpot" — that a compliance system can review without reading code.
~800 lines of Python, zero deps beyond Pydantic, MIT licensed. Would especially love feedback from folks building agentic systems in production — the schema library for domain-specific patterns is the most obvious area for contributions.
This solves the pre-execution side well. The complementary problem is post-delivery verification. Pre-flight checks validate "will this pipeline do what I asked?" Post-delivery verification answers "did this agent actually deliver what it promised?" Different trust boundary, same core insight: You need a contract between intent and execution. Does the manifest model extend to that kind of trust boundary?
I love this approach to verification. I literally just launched my own AI formatting engine yesterday, and the hardest part wasn't the generation—it was building strict system-level guardrails to stop the model from outputting generic fluff words and breaking my slide formatting. Are you doing this pre-execution verification purely through secondary prompt checks, or are you running it through a separate smaller model first?
Thanks! We're doing pre-execution verification through static analysis of the workflow AST — no secondary model involved. The verifier runs deterministically against declared effects and type constraints, so it catches issues before anything executes. Curious about your approach — are your guardrails rule-based or are you using a classifier?
We're currently using a strictly rule-based approach injected at the system-prompt level rather than a secondary classifier. Since ConvertlyAI handles 10 different output types (from raw Twitter threads to SEO blogs), we found that explicitly banning specific behaviors (like markdown wrappers, fake metrics, or specific generic AI buzzwords like 'Delve') directly in the main "systemRole" for "gpt" keeps latency low while still preventing formatting breaks. It's essentially a massive 'Do Not Do This' list passed right before execution.
Your static analysis approach for catching type constraints before execution sounds significantly more robust for complex workflows, though! Is that adding much latency?
The problem isn't that LLMs write bad code. It's that there's no contract between what you asked for and what runs. Structured outputs solve format. Guardrails AI solves content safety. Temporal solves execution. Nobody checks whether the workflow itself makes sense as a pipeline.
So I built a verification layer. The LLM outputs a workflow AST via structured outputs. Before anything executes, the engine type-checks data flow across steps, validates schemas at boundaries, and requires every side effect (API calls, DB writes, webhooks) to be explicitly declared. You get a manifest — "this workflow READs from Salesforce and WRITEs to HubSpot" — that a compliance system can review without reading code.
~800 lines of Python, zero deps beyond Pydantic, MIT licensed. Would especially love feedback from folks building agentic systems in production — the schema library for domain-specific patterns is the most obvious area for contributions.