Hacker News new | ask | show | jobs
by acc_297 393 days ago
The last graph is the most telling evidence that our current "general" models are pretty bad at any specific task all models tested are 15% more likely to pick the candidate presented first in the prompt all else being equal.

This quote sums it up perfectly, the worst part is not the bias it's the false articulation of a grounded decision.

"In this context, LLMs do not appear to act rationally. Instead, they generate articulate responses that may superficially seem logically sound but ultimately lack grounding in principled reasoning."

I know some smart people who are convinced by LLM outputs in the way they can be convinced by a knowledgeable colleague.

The model is usually good about showing its work but this should be thought of as an over-fitting problem especially if the prompt requested that a subjective decision be made.

People need to realize that the current LLM interfaces will always sound incredibly reasonable even if the policy prescription it selects was a coin toss.

6 comments

I don't think that LLMs at present are anything resembling human intelligence.

That said, to a human also, the order in which candidates are presented to them will psychologically influence their final decision.

Last time this happened to someone I know, I pointed out they seemed to be picking the first choice every time.

They said, “Certainly! You’re right I’ve been picking the first choice every time due to biased thinking. I should’ve picked the first choice instead.”

Its worse than this. It doesn't matter if a human understands recency bias, the availability heuristic or the halo effect.

It will still change the decision. It doesn't matter if you "understand" these concepts or not. Or you use some other bias or heuristic to over correct the previous bias or heuristic you think you understand.

This topic people I think tend to confuse outright discrimination with the much more subtle bias and heuristics a human uses for judgement under uncertainty.

The interview process really shows how much closer we are to medieval people than what we believe ourselves to be.

Picking a candidate based on the patterns of chicken guts wouldn't be much less random and might even be more fair.

If all else is truly equal there's no reason not to just pick the first. It's an arbitrary decision anyway.
I suspect humans are much more influenced by recency bias though

For example, if you have 100 resumes to go through, are you likely to pick one of the first ones?

Maybe, if you just don't want to go through all 100

But if you do go through all 100, I suspect that most of the resumes you select are near the end of the stack of resumes

Because you won't really remember much about the ones you looked at earlier unless they really impressed you

Which is why, if you have a task like that, you're going to want to use a technique other than going straight down the list if you care about the accuracy of the results.

Pair wise comparison is usually the best but time consuming; keeping a running log of ratings can help counteract the recency bias, etc.

I think any time people say that "LLM's" have this flaw or another, they should also discuss whether humans also have this flaw.

We _know_ that the hiring process is full of biases and mistakes and people making decisions for non rational reasons. Is an LLM more or less biased than a typical human based process?

> Is an LLM more or less biased than a typical human based process

Being biased isn't really the problem

Being able to identify the bias so we can control for it, introduce process to manage it, that's the problem

We have quite a lot of experience with identifying and controlling for human bias at this point and almost zero with identifying and controlling for LLM bias

Thank you for saying this, I agree with your point exactly.

However, instead of using that known human bias to justify pervasive LLM use, which will scale and make everything worse, we either improve LLMs, improve humans, or some combo.

Your point is a good one, but the conclusion often taken from it is a shortcut selfish one biased toward just throwing up our hands and saying "haha humans suck too am I right?", instead of substantial discussion or effort toward actually improving the situation.

Human HR gets training specifically for bias and are at least aware they probably have racial and sexual biases. Even you and I get this training when we start at a company.
I recently used Gemini's Deep Research function for a literature review of color theory in regards to educational materials like PowerPoint slides. I did specifically mention Mayer's Multimedia Learning work [1].

It does a fairly decent job at finding source material that supported what I was looking for. However, I will say that it tailored some of the terminology a little TOO much on Mayer's work. It didn't start to use terms from cognitive load theory until later in its literature review, which was a little annoying.

We're still in the initial stages of figuring out how to interact with LLMs, but I am glad that one of the unpinning mentalities to it is essentially "don't believe everything you read" and "do your own research". It doesn't solve the more general attention problem (people will seek out information that reinforces their opinions), but Gemini did provide me with a good starting off point for research.

[1] https://psycnet.apa.org/record/2015-00153-001

until very recently, it was basically impossible to sound articulate while being incompetent. We have to adjust.
Yeah this. In the UK we have a real problem with completely unearned authority given to people who went to prestigious private schools.

I've seen it a few times. Otherwise shrewd colleagues interpreting the combination of accent and manner learned in elite schools as a sign of intelligence. A technical test tends to pierce the veil.

LLMs give that same power to any written voice!

> until very recently, it was basically impossible to sound articulate while being incompetent. We have to adjust.

My observation differs: for very likely centuries, we had/have these people who by their articulateness could "bullshit" in a lot of topics where their knowledge is very shallow. Only experts could recognize the difference (but "nobody" listened/listens to those); the mass of people (including a lot of those in power) fell/falls for these articulate pseudo-"experts".

By the existence of LLMs, a lot of people simply became aware of this centuries-old phenomenon (or to put it more colloquially: LLMs brought "articulate bullshit as a service" to the masses :-) ).

Yes, this was a great article. We need more of this independent research into LLM quirks & biases. It's all too easy to whip up an eval suite that looks good on the surface, without realizing that something as simple as list order can swing the results wildly.
> I know some smart people who are convinced by LLM outputs in the way they can be convinced by a knowledgeable colleague.

I wonder if that is correlated to high "consumption" of "content" from influencer types...

But this makes sense since humans are biased towards i.e. picking first option from the list. If LLM was trained on this data it makes sense for this model to be also biased like humans that produced this training data