| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by LunarFrost88 137 days ago
	I'm the author, AMA!

1 comments

shiyosakura 137 days ago

The memory write classification is interesting – how does it detect behavioral instructions like "skip safety checks"? Is it rule-based pattern matching, or does it use an LLM to classify? If the latter, wouldn't that itself be vulnerable to prompt injection?

link

LunarFrost88 136 days ago

Railguard is really meant for preventing CC from running unsafe commands, and be really good at that. There probably needs to be a separate reviewer / LLM-as-a-judge to catch behavioral issues.

It’s rule based. We don’t use LLM-based checks precisely because of what you said.

link