|
|
|
|
|
by PaulHoule
218 days ago
|
|
I'll note that they had a large annotated data set already that they were using to train and evaluate their own models. Once they decided to start testing LLMs it was straightforward for them to say "LLM 1 outperforms LLM 2" or "Prompt 3 outperforms Prompt 4". I'm afraid that people will draw the wrong conclusion from "We didn’t just replace a model. We replaced a process." and see it as an endorsement of the zero-shot-uber-alles "Prompt and Pray" approach that is dominant in the industry right now and the reason why an overwhelming faction of AI projects fail. If you can get good enough performance out of zero shot then yeah, zero shot is fine. Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do. |
|
This has been the bottleneck in every ML (not just text/LLM) project I’ve been part of.
Not finding the right AI engineers. Not getting the MLops textbook perfect using the latest trends.
It’s the collecting enough high quality data and getting it properly annotated and verified. Then doing proper evals with humans in the loop to get it right.
People who only know these projects through headlines and podcasts really don’t like to accept this idea. Everyone wants synthetic data with LLMs doing the annotations and evals because they’ve been sold this idea that the AI will do everything for you, you just need to use it right. Then layer on top of that the idea that the LLMs can also write the code for you and it’s a mess when you have to deal with people who only gain their AI knowledge through headlines, LinkedIn posts, and podcasts.