| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by axlprose 939 days ago
	Disincentivizing it from saying mean things just strengthens it's agreeableness, and inadvertently incentivizes it to acquire social engineering skills. It's potential to cause havoc doesn't go away, it just teaches AI how to interact with us without raising suspicions, while simultaneously limiting our ability to prompt/control it.

1 comments

stavros 939 days ago

How do we tell whether it's safe or whether it's pretending to be safe?

link

axlprose 939 days ago

Your guess is about as good as anyone else's at this point. The best we can do is attempt to put safety mechanisms in place under the hood, but even that would just be speculative, because we can't actually tell what's going on in these LLM black boxes.

link

6gvONxR4sf7o 939 days ago

We don’t know yet. Hence all the people wanting to prioritize figuring it out.

link

losteric 939 days ago

How do we tell whether a human is safe? Incrementally granted trust with ongoing oversight is probably the best bet. Anyway, the first mailicious AGI would probably act like a toddler script-kiddie not some superhuman social engineering mastermind

link