|
|
|
|
|
by code51
1103 days ago
|
|
This "GPT4 evaluating LLMs" problem is not limited to this case. I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day. Couple this with the reliance on crowd-sourcing to create evaluation datasets and heavy use of GPT3.5 and GPT4 by MTurk workers, you have a big fat feed-forward process benefiting only one party: OpenAI. The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out. Reddit, Twitter and the like are awakening just now - to find that they're basically powerless against this wave of distorted future standards. When sufficiently proven to pass every existing test on Earth, every institution would be so reliant on producing work with GPT that we won't have a "%100 handmade exam" anymore. No problem will be left for GPT to be tackled with. |
|
Why? Because machine learning is not a scientific field. That means anyone can say and do whatever they like and there's no way to tell them that what they're doing is wrong. At this point, machine learning research is like the social sciences: a house of cards, unfalsifiable and unreproducible research built on top of other unfalsifiable and unreproducible research. People simply choose whatever approach they like, cite whatever result they like, because they like the result, not because there's any reason to trust it.
Let me not bitch again about the complete lack of anything like objective measures of success in language modelling, in particular. There have been no good metrics, no meaningful benchmarks, for many decades now, in NLP as a whole, but in language generation even more so. This is taught at students in NLP courses (our tutors discussed it in my MSc course) there is scholarship on it, there is a constant chorus of "we have no idea what we're doing" but nothing changes. It's too much hard work to try and find good metrics, build good benchmarks. It's much easier to put a paper on arxiv that shows SOTA results (0.01 more than the best system compared to!). And so the house of cards rises ever towards the sky.
Here's a recent paper that points out the sorry state of Natural Language Understanding (NLU) benchmarking:
What Will it Take to Fix Benchmarking in Natural Language Understanding?
https://aclanthology.org/2021.naacl-main.385/
There are many more, going back years. There are studies of how top-notch performance on NLU benchmarks is reduced to dust when the statistical regularities that models learn to overfit to in test datasets are removed. Nobody. fucking. cares. You can take your science and go home, we're making billion$$$ here!