Hacker News new | ask | show | jobs
by snyhlxde 496 days ago
It's funny that reasoning models sometime speaking nonsense and perform worse than well-aligned models like claude-3.5-sonnet in multi-turn games like Akinator. I think it's one current weak point of applying longCoT RL vs. instruction-following alignment. Maybe we need to find a way to address both? Would be interesting to see some results