| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PaulHoule 218 days ago

I'll note that they had a large annotated data set already that they were using to train and evaluate their own models. Once they decided to start testing LLMs it was straightforward for them to say "LLM 1 outperforms LLM 2" or "Prompt 3 outperforms Prompt 4".

I'm afraid that people will draw the wrong conclusion from "We didn’t just replace a model. We replaced a process." and see it as an endorsement of the zero-shot-uber-alles "Prompt and Pray" approach that is dominant in the industry right now and the reason why an overwhelming faction of AI projects fail.

If you can get good enough performance out of zero shot then yeah, zero shot is fine. Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.

3 comments

Aurornis 218 days ago

> Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.

This has been the bottleneck in every ML (not just text/LLM) project I’ve been part of.

Not finding the right AI engineers. Not getting the MLops textbook perfect using the latest trends.

It’s the collecting enough high quality data and getting it properly annotated and verified. Then doing proper evals with humans in the loop to get it right.

People who only know these projects through headlines and podcasts really don’t like to accept this idea. Everyone wants synthetic data with LLMs doing the annotations and evals because they’ve been sold this idea that the AI will do everything for you, you just need to use it right. Then layer on top of that the idea that the LLMs can also write the code for you and it’s a mess when you have to deal with people who only gain their AI knowledge through headlines, LinkedIn posts, and podcasts.

link

isoprophlex 218 days ago

Amen brother. Working on a computer vision project right now, it's a wild success.

This isn't my first CV project, but it's the most successful one. And that chiefly because my client pulled out their wallets and let an army of annotators create all the train data I asked for, and more.

link

spwa4 218 days ago

This has been the huge problem in AI research since at least 1998 (and that was just when I was first exposed to it). With data, everything is so much easier, and much simpler machine learning methods.

Supervised learning. Took a while to make that work well.

And then every few years someone comes up with a way to distill data out of unsupervised examples. GPT is these days the big example of that, but there was "ImageNet (unlabeled)" and LAION before that too. The issue is that there is just so much unsupervised data.

Now LLMs use that pretty well (even though stuffing everything into an LLM is getting old, and as this article points out, in any specific application they tend to get bested by something like XGBoost with very simple models)

The next frontier is probably "world models", where you first train unsupervised, not to train your model but to predict the world. THEN you train the model in this simulated, predicted world. That's the reason Yann Lecun really really wants to go down this direction.

link

jacquesm 217 days ago

> Now LLMs use that pretty well (even though stuffing everything into an LLM is getting old, and as this article points out, in any specific application they tend to get bested by something like XGBoost with very simple models)

You can't blame the users for that though, for instance, OpenAI's ChatGPT uses 'Ask Anything' as their home page prompt. Zero specialization, expert at anything. And people totally believe it.

link

PaulHoule 218 days ago

I’ve got no problem w/ synthetic data, but it is still more work that most people want to do.

link

richardlblair 218 days ago

There was a post on here recently about how you should build your own agent, and I completely agree. I'd say most competent developers should be building even more complex projects than an agent. Once you do you quickly realize how it's a constant uphill battle, and it quickly becomes apparent that the data you're working with is the primary issue.

link

beepbooptheory 218 days ago

I don't know if that is what gp and above is talking about. "Agents" are the kind of thing/word that helps to paper over the very fact that these things only work because of huge amount of humans in-the-loop in the outset (that is, you know, labor). Agents help us believe that LLM's can do everything for us, even bootstrap themselves, but, what the above thread is about is that, really, what you get out correlates only to what you put in in the first place.

link

zahlman 217 days ago

> Agents help us believe that LLM's can do everything for us, even bootstrap themselves

Having the agent, and treating it carelessly, helps one believe this.

Making it is another story.

link

scrame 217 days ago

> an overwhelming faction of AI projects fail.

An overwhelming amount of software projects fail, AI just helps them get there faster.

link

ghm2180 218 days ago

I would offer a stronger more pointed observation, ofen the problem in building a good classifier is having good negative examples. More generally how a classifier identify good negatives is a function of:

1. Data collection technique.

2. Data annotation(labelling).

3. Classfier can learn on your "good" negatives — quantitaively depending on the machine residuals/margin/contrastive/triplet losses — i.e. learn the difference between a negative and positive for a classifier at train time and the optimization minima is higher than at test time.

4. Calibration/Reranking and other Post Processing.

My guess is that they hit a sweet spot with the first 3 techniques.

link

jacquesm 217 days ago

I think the biggest problem with such classifiers is to actually know what is good data and what is bad data. To take a sample of the data and to recognize whether or not this dataset is a general enough representation of both true and false examples (for a binary classifier) to be able to use it to train a model. Because it isn't rare at all to have data sets that are biased 100 to 1 or more for one of the classes, which contain hints about what class the object is in that isn't in the object itself and so on. You can train until the cows come home on such data but it will never lead to satisfactory results.

link

ghm2180 211 days ago

So the bias is an issue can be handled in a variety of ways, one which I know to work is to use weights on your rarer class when training. You could also use larger margins to make sure you definitely don't mis-classify the rare class at the cost of mislableling your dominant class — presuming you are ok with it. An example is when doctors order breast biopsies, it happens a lot more than the cancer itself and based on a noisy technique of physical exam.

link