Hacker News new | ask | show | jobs
by Animats 893 days ago
"Responses are selected randomly from a group of base chat AIs. ... The response generated by a specific chat AI is conditional on all previous responses generated by the previously selected chat AIs."

That's all? That works? Useful.

Could that be extended? It doesn't seem inherent in this that all the chat AIs have to be LLMs. Some might be special-purpose systems. Solvers or knowledge bases, such as Wolfram Alpha or a database front end, could play too. Systems at the Alexa/Siri level that can do simple tasks. Domain-specific systems with natural language in and out have been around for decades.

2 comments

Why they aren't computing the next token marginal and sampling that? All I'm coming up with is that it's a reasonable way to work around dealing with different tokenizers.
Seems a lot like ensembling methods for traditional predictive models.
This is considerably weirder than ensembling because they are not, in any sense, averaging the predictions or taking a majority vote or in some way collectively processing multiple models to yield a single slightly-better meta-model. They are just... randomly picking a model to use to generate the next response. And users find that more entertaining to talk to?

As there is no analysis of why that is better or evaluation of alternative approaches (what if you alternated A/B/A/B? Or cycled through them systematically A/B/C/A? or picked a different shuffle of A/B/C each time?), it's hard to say what this means.

My best guess is that this reflects the fact that GPT is, thanks to RLHF, boring. It has mode-collapse and does things like tell one of a handful of jokes every time. It will write a rhyming poem even if you ask it for a different kind of poem. And so on.

The random sampling of different models serves as a rather ad hoc way of avoiding the RLHF boringness. The various models might all be tuned similarly, but they won't yield identical results, and this sneaks in response diversity through the backdoor, undoing the same-ness from the RLHF mode collapse.

You used to be able to increase the sampling temperature on GPT to undo some of this blandness, but since RLHF flattens the logits in GPT-4, it's unclear if that still helps. So swapping in random models may be a useful trick. (Although fixing the tuning itself would be much more desirable.)

Not quite. The outputs of the models become part of the prompt history of all the models. So they can assist each other. For example, one model might produce a highly technical reply. Then the user can ask for explanations, to be provided by a model with better language capability but less domain expertise.