| Happy to explain how the scoring works since that’s the obvious first question. The core idea is: Safety Score = 100 − riskScore The risk score is based on structural prompt properties that tend to correlate with failures in production systems: - instruction hierarchy ambiguity
- conflicting directives (system vs user)
- missing output constraints
- unconstrained response scope
- token cost / context pressure Each factor contributes a weighted amount to the total risk score. It’s not trying to predict exact model behavior — that’s not possible statically. The goal is closer to a linter:
flagging prompt structures that are more likely to break (injection, hallucination drift, ignored constraints, etc). There’s also a lightweight pattern registry. If a prompt matches structural patterns seen in real jailbreak/injection cases (e.g. authority ambiguity), the score increases. One thing that surprised me while building it:
instruction hierarchy ambiguity caused more real-world failures than obvious injection patterns. The CLI runs locally — no prompts are sent anywhere. If you want to try it: npm install -g @camj78/costguardai
costguardai analyze your-prompt.txt Curious what failure modes others here have seen in production prompts. |