Hacker News new | ask | show | jobs
by shiyosakura 91 days ago
The memory write classification is interesting – how does it detect behavioral instructions like "skip safety checks"? Is it rule-based pattern matching, or does it use an LLM to classify? If the latter, wouldn't that itself be vulnerable to prompt injection?
1 comments

Railguard is really meant for preventing CC from running unsafe commands, and be really good at that. There probably needs to be a separate reviewer / LLM-as-a-judge to catch behavioral issues.

It’s rule based. We don’t use LLM-based checks precisely because of what you said.