| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by code51 1103 days ago

This "GPT4 evaluating LLMs" problem is not limited to this case. I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.

Couple this with the reliance on crowd-sourcing to create evaluation datasets and heavy use of GPT3.5 and GPT4 by MTurk workers, you have a big fat feed-forward process benefiting only one party: OpenAI.

The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out. Reddit, Twitter and the like are awakening just now - to find that they're basically powerless against this wave of distorted future standards.

When sufficiently proven to pass every existing test on Earth, every institution would be so reliant on producing work with GPT that we won't have a "%100 handmade exam" anymore. No problem will be left for GPT to be tackled with.

3 comments

YeGoblynQueenne 1103 days ago

>> I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.

Why? Because machine learning is not a scientific field. That means anyone can say and do whatever they like and there's no way to tell them that what they're doing is wrong. At this point, machine learning research is like the social sciences: a house of cards, unfalsifiable and unreproducible research built on top of other unfalsifiable and unreproducible research. People simply choose whatever approach they like, cite whatever result they like, because they like the result, not because there's any reason to trust it.

Let me not bitch again about the complete lack of anything like objective measures of success in language modelling, in particular. There have been no good metrics, no meaningful benchmarks, for many decades now, in NLP as a whole, but in language generation even more so. This is taught at students in NLP courses (our tutors discussed it in my MSc course) there is scholarship on it, there is a constant chorus of "we have no idea what we're doing" but nothing changes. It's too much hard work to try and find good metrics, build good benchmarks. It's much easier to put a paper on arxiv that shows SOTA results (0.01 more than the best system compared to!). And so the house of cards rises ever towards the sky.

Here's a recent paper that points out the sorry state of Natural Language Understanding (NLU) benchmarking:

What Will it Take to Fix Benchmarking in Natural Language Understanding?

https://aclanthology.org/2021.naacl-main.385/

There are many more, going back years. There are studies of how top-notch performance on NLU benchmarks is reduced to dust when the statistical regularities that models learn to overfit to in test datasets are removed. Nobody. fucking. cares. You can take your science and go home, we're making billion$$$ here!

link

mensetmanusman 1102 days ago

I would have said machine learning is more like materials science, but you are on the right track.

As you increase the number of bits you are trying to comprehend, you move from quantum physics to chemistry to material science to biology to social science.

At certain points, the methods and reproducibility become somewhat of a dark art. I have experience that in my field of materials science.

Because these models are using billions or trillions of random number generators in their probability chains, it starts looking more like the harder hard sciences, it gets very difficult to track and understand what is important.

I think machine learning will be easier to comprehend than social sciences, so I wouldn't put it that high. It will be something between materials science and biology levels of difficulty in understanding.

link

emme 1103 days ago

Yes, I have similar concerns. These models regurgitate previously seen strings, previous benchmarks included. When you try to evaluate their sheer ability to reason on the text, however, they perform poorly. (Our experiments with GPT-3 are here: https://doi.org/10.5220/0012007500003470)

link

wizzwizz4 1103 days ago

> The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out.

If OpenAI ceased to be – probably for some legislative reason –, would the problems go away?

link

ChatGTP 1103 days ago

The damage will have been done so I don’t think so.

link