Hacker News new | ask | show | jobs
by VitalStack 83 days ago
The shared-state manifest pattern is the right call — we hit the same design question building VitalStack (supplement interaction MCP server, vital-stack.com). Ended up using Supabase instead of a file because our skills need live lookup, but the principle is identical: one source of truth that downstream skills read and annotate.

Your confidence tiers (validated/researched/assumed) resonates too — we distinguish PubMed RCT data from case reports from mechanistic inference. Ended up being the most important UX decision: users need to know why a result is flagged, not just that it is.

The anti-slop checks are clever. Does the swap test run at generation time or as a post-processing step? Curious whether you're prompting for it explicitly or checking output against a classifier.

1 comments

Great parallel with PubMed RCT vs case reports vs mechanistic inference, and it highlights the same core problem: users trusting output they shouldn't is worse than no output at all. The confidence tiers turned out to be satisfyingly efficient to implement (just YAML frontmatter fields that each skill reads and annotates) and they change how you interact with the results.

On the Supabase vs file question: makes total sense for live lookup. The file-based manifest works here because brand building is user-driven and session-based. You're not querying supplement interactions in real time. But the principle is the same: one source of truth, downstream consumers read and annotate, never duplicate state.

About the swap test: it runs at generation time, inside the skill prompt. Each skill has instructions that say "before presenting output, run these checks."

If the swap test fails, the skill flags it inline, explains why it failed, and rewrites before the user ever sees the first version. No separate classifier or post-processing step.

Tradeoff: it's not as rigorous as a trained classifier.

For brand copy the swap test is not complicated, for example: if a statement can generically apply to any brand/service, it fails. A classifier would be overkill here, though probably necessary for something like your PubMed evidence grading.