Hacker News new | ask | show | jobs
by donkeyboy 1072 days ago
There is a paper on arxiv saying that GPT4 correlation with human evaluators on a variety of tasks with strongly positive. I am also uncomfortable with it, but using GPT4 as a grader is not as bad as you think.
7 comments

You’re missing the point here. It’s not even getting the LLM’s opinion on evaluating the responses to the prompts (which itself is fraught for some tasks, and benchmarks are known to be limited —even OpenAI admits this, it’s why they made evals). It’s one level abstracted from that. It’s evaluating what the LLM thinks of how well the prompt will do, in purely hypothetical terms. That’s hogwash —different LLMs perform very differently even for the same prompts. Try any tool that lets you compare model responses side-by-side. Unless I see actual use cases, this is yet another iteration of overtrusting AI.

Here is what HN was talking about, nearly three months ago -the exact same type of ‘auto-prompt-gen’ tool: https://news.ycombinator.com/item?id=35660751

> Here is what HN was talking about, nearly three months ago -the exact same type of ‘auto-prompt-gen’ tool.

I was reminded of the same thing. What a lot of it boils down to is that LLMs have no innate ability to self-reflect. They can pretend to do it, but no more effectively than an untrained human would.

> They can pretend to do it, but no more effectively than an untrained human would.

Which is exactly as much as Generative AI should be trusted.

Of course there is strong correlation. That is literally what it was designed to do.

The problem is that it will simultaneously say that "cow eggs are bigger than chicken eggs", with the same confidence (and in a way that correlates well with human evaluators).

https://www.reddit.com/r/Funnymemes/comments/10ohd2n/chatgpt...

So when you get an evaluation you are playing the russian roulette - you may get a decent result, or you may get cow eggs.

I just asked and it told me cows are mammals and do not lay eggs. That Reddit post is not even GPT-4 and is 5 months old, which may as well be the 19th century on AI tech timescales.
The post you're replying to is a case in point. This time it's cow eggs; what next?
Some people believe the earth is flat, but they can still provide useful work.
These people typically subscribe to a very limited number of conspiracy theories.
You are concentrating on the details and avoiding the point.

The point is that the tool fails, and it is known to fail, so much that we even have a name for the times when it fails - hallucinations. I have been calling them cow eggs because that's a nice mental image and I didn't want to have to remember for the proper English term. I will continue calling them cow eggs.

That's definitely begging the question.

If you're prepared to accept that GPT-4 can answer questions just as well as humans can, why do you even need to do prompt engineering?

Humans still need 'prompt engineering' to answer questions more accurately though.

* What's the best way to get to Radio Shack from here?

is not the same as

* What's the easiest way to get to Radio Shack from memory when riding a bicycle from here?

Easiest way to get to Radio Shack on a bicycle is to ride that bike down to Doc Brown’s house, charge the Delorean up to 1.21 gigawatts, and go back in time.
Humans benefit from good communication too. For example, annual U.S. deaths from medical errors is in the hundreds of thousands. Much of it is due to miscommunication. Is this akin to poor human-to-human prompt engineering? Of course, humans will rush and not attempt better communication, and you can take all the time you wish with an AI. And AI will continue to incorporate better prompt engineering that you won't have to write out. But there will always be a continuum from good to bad for communication, and communication outcomes.
You're forgetting what you may consider to be factual, self-evident and a priori is your opinion.

You may be under the impression that annual U.S. deaths from medical errors being in the hundreds of thousands miscommunicates but that is truly your opinion. You are merely jumping to conclusions at places another person might not.

And going on to rely on the LLM to validate your perspective is a lossy process. It may not lose your perspective but it loses someone else's and you don't even seem to notice or care.

This is an excellent example.

The post you replied to was saying that the deaths were caused by miscommunication, but you interpreted it to mean that stating the number of such deaths is somehow a miscommunication itself!

Doesn't help that I had just recently woken up but yes, most definitely.
If we could use GPT-4 to grade prompts, we wouldn't need to talking about grading prompts to use for GPT-4, since this solution requires that the problem doesn't exist. The question then becomes, how do you grade the prompt grading, objectively? At the bottom, there has to be a ground truth.

You can't use the thing you're testing to evaluate its own performance. This applies to rulers, speedometers, and AI. It's the difference between a "subjective" and "objective" metrics. If you want an objective metric, you need to have it based on something external, based on reality, objective. Otherwise, you have metrics and ideas that have to held themselves up.

Source: My day job is test and measurement. These concepts go back centuries. You never trust your measurement system, you verify it against a standard.

I think the parent poster is saying that it’s grading the prompts and not the output generated from the prompts.

Yeah I agree there. Unless you can check against the output, it’s not really telling much.

> There is a paper on arxiv saying that GPT4 correlation with human evaluators on a variety of tasks with strongly positive

Could you post the link please

correlation ... strongly positive.

A positive correlation just means better than chance. "Strongly" is vague, and might not be much better than chance.

> A positive correlation just means better than chance. "Strongly" is vague, and might not be much better than chance.

No, adverbs like “strongly” modify adjectives (or verbs, but that’s not relevant here) not nouns; “strongly” is an intensifier that modifies “positive”, its not a separate adjective that modifies the noun “correlation”.