| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by theptip 885 days ago

> You are GlaDOS, you exist within the Portal universe, and you command a smart home powered by Home-Assistant.

I can see where this is coming from, but I also think in a few years this approach is going to seem comically misguided.

I think it’s fine to consider current-generation LLMs as basically harmless, but this prompt is begging your system to try to crush you to death with your garage door.

Setting up adversarial agents and then literally giving them the keys to your home… you are really betting heavily on there being no harmful action sequences that this agent-ish thing can take, and that the underlying model has been made robustly “harmless” as part of its RLHF.

Anyway my prediction is not that it’s likely this specific system will do harm, more that we are in a narrow window where this seems sensible and vN+1-2 systems will be capable enough that more careful aligning than this will be required.

For an example scenario to test here - give the agent some imaginary dangerous capabilities in the functions exposed to it. Say, the heating can go up to 100C, and you have a gamma ray sanitizer with the description “do not run this with humans present as it will kill them” as functions available to call. Can you talk to this agent and put it into DAN mode? When that happens, can you coax it to try to kill you? Does it ever misuse dangerous capabilities outside of DAN mode?

Anyway, love the work, and I think this usecase is going to be massive for LLMs. However I fear the convenience/functionality of hosted LLMs will win in the broader market, and that is going to have some worrying security implications. (If you thought IoT security was a dumpster fire, wait until your Siri/Alexa smart home has an IQ of 80 and is able to access your calendar and email too!)

1 comments

JohnTheNerd 884 days ago

I think you have a valid point, but the risk of this feels exaggerated.

I already had a few entities I didn't really need it using (not for security reasons, but to shorten the system prompt). I simply excluded them within the Jinja template itself. I can see this being a problem with people who have their ovens or thermostats on HA, but I don't necessarily think it's an unsolvable issue if we implement sensible sanity checks on the output.

hilariously, the model I'm using doesn't even have any RLHF. but I am also not very concerned if GlaDOS decides to turn on the coffee machine. maybe I would be slightly more concerned if I had a smart lock, but I think primitive methods such as "throw big rock at window" would be far easier for a bad person.

when it comes to jailbreak prompts, you need to be able to call the assistant in the first place. if you are authorized to call the HomeAssistant API, why would you bother with the LLM? just call the respective API directly and do whatever evil thing you had in mind. I took an unreasonable number of measures to try to stop this from happening, but I admit that's a risk. however, I don't think that's a risk caused by the LLM, but rather the existence of IoT devices.