Hacker News new | ask | show | jobs
by abeppu 893 days ago
Ok, this seems bunk basically because they never really provide evidence of "better".

> ... traditiontal gold-standard approaches use human evaluators that score the quality of generated responses, which can be costly. However, since chat AIs are by definition deployed in social environments with humans, one can leverage statistics of users interaction as a meaningful and aligned measure of chat AI engagingness and quality. To assess the ’quality’ of a chat AI, we consider two main proxy functions: the industry standard user retention and the main objective function, user engagement.

Maybe retention and engagement _are_ sufficiently well correlated to human evaluations, but you should probably do both and show that they're strongly correlated before you decide to just drop the human evaluators in favor of your cheap proxy measurements.

And in this field, where there are some known issues with chat LLMs, perhaps it's important to check stuff like:

- Does the model seem "engaging" just b/c the user has to refine their prompt several times before they get a satisfying response?

- Do responses include a lot of hallucinations which might be engaging but not true?

- Do successive responses show decreased consistency or coherence between messages, in a way that might accidentally elicit continued engagement?

Overall, it seems sloppy to believe that it's not a waste of humans time to talk to your chatbots, and it's not a waste of time for readers to look at this paper about your chatbots, but it's too expensive for you to actually measure the quality of responses from your chatbots.

2 comments

They're making chatbots specifically for humans to waste time with them (a.k.a. entertainment.)

Engagement and user retention are directly connected to their bottom line in a way that quality responses (e.g. introducing you to a more fulfilling hobby than chatting with AIs) are not.

That is what I read in this paper as well. It is not about "better as better performance" it is "better as improved user retention".
The papers title makes a stronger claim than the abstract. The abstract makes a stronger claim than the paper. It is like they couldn't decide what paper to write.

Edit: Thinking about it, this is exactly what you might expect from a paper written by a stochastic mix of experts.

Optimizing AIs to be addictive to humans is always how humanity will end. It was the natural end to social media, and market forces will force the same to happen in this industry.

People worry about the robot uprising killing all humans but never think about the far more likely AI domestication of humans.

This criticism seems out of touch.

They are presenting a real world use case where retention and engagement is clearly the metric of interest. It's not even clear what "human evaluations" would even mean in this context.

Kudos to not falling into the benchmark / human eval trap, and just testing your theories directly at scale in a deployment setting.