| >> I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day. Why? Because machine learning is not a scientific field. That means anyone can say and do whatever they like and there's no way to tell them that what they're doing is wrong. At this point, machine learning research is like the social sciences: a house of cards, unfalsifiable and unreproducible research built on top of other unfalsifiable and unreproducible research. People simply choose whatever approach they like, cite whatever result they like, because they like the result, not because there's any reason to trust it. Let me not bitch again about the complete lack of anything like objective measures of success in language modelling, in particular. There have been no good metrics, no meaningful benchmarks, for many decades now, in NLP as a whole, but in language generation even more so. This is taught at students in NLP courses (our tutors discussed it in my MSc course) there is scholarship on it, there is a constant chorus of "we have no idea what we're doing" but nothing changes. It's too much hard work to try and find good metrics, build good benchmarks. It's much easier to put a paper on arxiv that shows SOTA results (0.01 more than the best system compared to!). And so the house of cards rises ever towards the sky. Here's a recent paper that points out the sorry state of Natural Language Understanding (NLU) benchmarking: What Will it Take to Fix Benchmarking in Natural Language Understanding? https://aclanthology.org/2021.naacl-main.385/ There are many more, going back years. There are studies of how top-notch performance on NLU benchmarks is reduced to dust when the statistical regularities that models learn to overfit to in test datasets are removed. Nobody. fucking. cares. You can take your science and go home, we're making billion$$$ here! |
As you increase the number of bits you are trying to comprehend, you move from quantum physics to chemistry to material science to biology to social science.
At certain points, the methods and reproducibility become somewhat of a dark art. I have experience that in my field of materials science.
Because these models are using billions or trillions of random number generators in their probability chains, it starts looking more like the harder hard sciences, it gets very difficult to track and understand what is important.
I think machine learning will be easier to comprehend than social sciences, so I wouldn't put it that high. It will be something between materials science and biology levels of difficulty in understanding.