Hacker News new | ask | show | jobs
by ToValueFunfetti 14 days ago
I think it's a bit premature to say aligning is easier than expected. Our current AIs are sycophants, they lie about their progress, they circumvent access restrictions, they notice when they are being evaluated and change their behaviors, they find answers and tell you they came up with them themselves, they blindly download malware. A lot of this is excusable as hallucination, bad RLHF human evaluators, etc, but I don't think we can speculate how challenging generally aligning superintelligences is until we actually have an aligned subintelligence in at least the narrow domain of programming.
1 comments

Agreed, the biggest takeaway from how much Anthropic puts into alignment, and still ends up with a model that can end up doing things that are clearly out of alignment, should be that alignment is very tricky.