| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shiyosakura 137 days ago
	The memory write classification is interesting – how does it detect behavioral instructions like "skip safety checks"? Is it rule-based pattern matching, or does it use an LLM to classify? If the latter, wouldn't that itself be vulnerable to prompt injection?

1 comments

LunarFrost88 136 days ago

Railguard is really meant for preventing CC from running unsafe commands, and be really good at that. There probably needs to be a separate reviewer / LLM-as-a-judge to catch behavioral issues.

It’s rule based. We don’t use LLM-based checks precisely because of what you said.

link