| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by softg 985 days ago

How can you ever be sure that you trained your LLM not to do harm and not pretend not to do harm when it's tested? Something like VW's diesel engines but more sinister.

I feel like unless we gain the ability to debug each node the way we do with actual software we won't be able to solve the alignment problem. I saw on HN that antropic is working on it but I'm not knowledgeable enough on the subject to comment if it's actually feasible.

Probably the best case scenario for humanity is that LLMs plateau somehow and don't get much better for quite some time.