There is a paper on arxiv saying that GPT4 correlation with human evaluators on a variety of tasks with strongly positive. I am also uncomfortable with it, but using GPT4 as a grader is not as bad as you think.
You’re missing the point here. It’s not even getting the LLM’s opinion on evaluating the responses to the prompts (which itself is fraught for some tasks, and benchmarks are known to be limited —even OpenAI admits this, it’s why they made evals). It’s one level abstracted from that. It’s evaluating what the LLM thinks of how well the prompt will do, in purely hypothetical terms. That’s hogwash —different LLMs perform very differently even for the same prompts. Try any tool that lets you compare model responses side-by-side. Unless I see actual use cases, this is yet another iteration of overtrusting AI.
> Here is what HN was talking about, nearly three months ago -the exact same type of ‘auto-prompt-gen’ tool.
I was reminded of the same thing. What a lot of it boils down to is that LLMs have no innate ability to self-reflect. They can pretend to do it, but no more effectively than an untrained human would.
Of course there is strong correlation. That is literally what it was designed to do.
The problem is that it will simultaneously say that "cow eggs are bigger than chicken eggs", with the same confidence (and in a way that correlates well with human evaluators).
I just asked and it told me cows are mammals and do not lay eggs. That Reddit post is not even GPT-4 and is 5 months old, which may as well be the 19th century on AI tech timescales.
You are concentrating on the details and avoiding the point.
The point is that the tool fails, and it is known to fail, so much that we even have a name for the times when it fails - hallucinations. I have been calling them cow eggs because that's a nice mental image and I didn't want to have to remember for the proper English term. I will continue calling them cow eggs.
Easiest way to get to Radio Shack on a bicycle is to ride that bike down to Doc Brown’s house, charge the Delorean up to 1.21 gigawatts, and go back in time.
Humans benefit from good communication too. For example, annual U.S. deaths from medical errors is in the hundreds of thousands. Much of it is due to miscommunication. Is this akin to poor human-to-human prompt engineering? Of course, humans will rush and not attempt better communication, and you can take all the time you wish with an AI. And AI will continue to incorporate better prompt engineering that you won't have to write out. But there will always be a continuum from good to bad for communication, and communication outcomes.
You're forgetting what you may consider to be factual, self-evident and a priori is your opinion.
You may be under the impression that annual U.S. deaths from medical errors being in the hundreds of thousands miscommunicates but that is truly your opinion. You are merely jumping to conclusions at places another person might not.
And going on to rely on the LLM to validate your perspective is a lossy process. It may not lose your perspective but it loses someone else's and you don't even seem to notice or care.
The post you replied to was saying that the deaths were caused by miscommunication, but you interpreted it to mean that stating the number of such deaths is somehow a miscommunication itself!
If we could use GPT-4 to grade prompts, we wouldn't need to talking about grading prompts to use for GPT-4, since this solution requires that the problem doesn't exist. The question then becomes, how do you grade the prompt grading, objectively? At the bottom, there has to be a ground truth.
You can't use the thing you're testing to evaluate its own performance. This applies to rulers, speedometers, and AI. It's the difference between a "subjective" and "objective" metrics. If you want an objective metric, you need to have it based on something external, based on reality, objective. Otherwise, you have metrics and ideas that have to held themselves up.
Source: My day job is test and measurement. These concepts go back centuries. You never trust your measurement system, you verify it against a standard.
> A positive correlation just means better than chance. "Strongly" is vague, and might not be much better than chance.
No, adverbs like “strongly” modify adjectives (or verbs, but that’s not relevant here) not nouns; “strongly” is an intensifier that modifies “positive”, its not a separate adjective that modifies the noun “correlation”.
Here is what HN was talking about, nearly three months ago -the exact same type of ‘auto-prompt-gen’ tool: https://news.ycombinator.com/item?id=35660751