| HN Mirror

This is considerably weirder than ensembling because they are not, in any sense, averaging the predictions or taking a majority vote or in some way collectively processing multiple models to yield a single slightly-better meta-model. They are just... randomly picking a model to use to generate the next response. And users find that more entertaining to talk to?

As there is no analysis of why that is better or evaluation of alternative approaches (what if you alternated A/B/A/B? Or cycled through them systematically A/B/C/A? or picked a different shuffle of A/B/C each time?), it's hard to say what this means.

My best guess is that this reflects the fact that GPT is, thanks to RLHF, boring. It has mode-collapse and does things like tell one of a handful of jokes every time. It will write a rhyming poem even if you ask it for a different kind of poem. And so on.

The random sampling of different models serves as a rather ad hoc way of avoiding the RLHF boringness. The various models might all be tuned similarly, but they won't yield identical results, and this sneaks in response diversity through the backdoor, undoing the same-ness from the RLHF mode collapse.

You used to be able to increase the sampling temperature on GPT to undo some of this blandness, but since RLHF flattens the logits in GPT-4, it's unclear if that still helps. So swapping in random models may be a useful trick. (Although fixing the tuning itself would be much more desirable.)