Hacker News new | ask | show | jobs
by LunarFrost88 91 days ago
I'm the author, AMA!
1 comments

The memory write classification is interesting – how does it detect behavioral instructions like "skip safety checks"? Is it rule-based pattern matching, or does it use an LLM to classify? If the latter, wouldn't that itself be vulnerable to prompt injection?
Railguard is really meant for preventing CC from running unsafe commands, and be really good at that. There probably needs to be a separate reviewer / LLM-as-a-judge to catch behavioral issues.

It’s rule based. We don’t use LLM-based checks precisely because of what you said.