| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mnkv 259 days ago

> the generation of 281,128 augmented examples, from which 1,000 were held out as a benchmark test set.

This model is trained on a custom dataset of 280k examples then tested on 1k very similar examples from the same dataset. Of course it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.

This is a reasonable hobby project and interesting approach to synthetic data generation but not impressive research.

At minimum you should test your model on other benchmarks that have similar tasks e.g. docbench

4 comments

gundmc 259 days ago

It's not novel research, but I think it drives home the point that many narrow applications of AI do not require the largest, latest (and most expensive) models. And in many of those cases, a small fine-tuned model is the most performant and cost-effective.

It is probably obvious to most who follow the space closely, but you'd be surprised how many engineers don't recognize this.

Garlef 259 days ago

It's a matter of ROI: When is it worth it to build something specialized?

sigbottle 259 days ago

Well, one day it might be at the level of shell scripting. I don't think about "the tradeoffs of building a specialized shell script", I just do it because it's cheap and easy and solves a problem right then and there.

I don't know how you would even begin to make this kind of same observation for ML models, but seems possible. The 2010s weren't exactly building out "trivial" models, but compared to the architectures and optimizations out now, yeah those models are toy by comparison.

ImJasonH 259 days ago

Is anybody working on making building specialized things easier and cheaper?

-_- 259 days ago

Yes! At https://RunRL.com we offer hosted RL fine-tuning, so all you need to provide is a dataset and reward function or environment.

selim-now 258 days ago

yes! check out https://distillabs.ai/ – follows a similar approach except the evaluation set is held out before the synthetic data generation, which I would argue makes it more robust (I'm affiliated)

bangaladore 259 days ago

> Of course, it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.

My understanding is generally this is not considered an obvious result. In that high parameter generalist models largely outperform lower parameter specialists.

The real issue is they tested on data in their training set. *

* Incorrect-- Edit misread parent comment.

littlestymaar 259 days ago

> The real issue is they tested on data in their training set.

Hm, no.

They trained on a part of their synthetic set and tested on another part of the set. Or at least that's what they said they did:

> from which 1,000 were held out as a benchmark test set.

Emphasis mine.

_carltg 259 days ago

Yes, but due to it being derived from the same underlying source dataset, it is effectively evaluating on the training dataset, not an independent validation/ test dataset.

The difference is subtle but important. If we expect the model to truly outperform a general model, it should generalize to a completely independent set.

bangaladore 259 days ago

Thanks, rereading it makes it clear that you are correct.

disiplus 259 days ago

They did not test on the data that they tested, that's not what he wrote.

DetroitThrow 259 days ago

They synthetically generated 290k examples and kept 10k of them for testing.

It's worth pointing out that that's technically not testing on the training set, but looking at how similar examples are in the dataset, it's clear that severe overfitting would be unavoidable. That also makes the headline very misleading.

The weights may not be published since using it for document extraction on even the same format but with slightly different content or lengths would show how abysmal this finetune does outside of the synthetic data.

bangaladore 259 days ago

Thanks, rereading it makes it clear that you are correct.

kingjimmy 259 days ago

in todays news, overfit models are overfit.

m3kw9 259 days ago

So they tested using training examples? Lmao

fxwin 259 days ago

> held out

Aperocky 259 days ago

Actually in this case that's not exactly true:

> generation of 281,128 augmented examples

All example are already correlated because they are generated in the same way.

littlestymaar 259 days ago

> All example are already correlated because they are generated in the same way.

All examples of “document information extraction” would be correlated no matter where they come from because they all would be “document information extraction” examples…

The real question is whether or not the examples are representative of the broad “document information extraction” use-case.

_carltg 259 days ago

The problem is the methodology they use to hold them out. For a truly independent validation set, they need to hold out the material before augmentation, not after. If you hold out after augmentation, then you leverage biases from the training regimen already and hence you artificially boost your model's performance. This is not sufficient to demonstrate your model is generalizing properly.

In analogy: instead of taking leaves off of different trees, they are taking leaves from different branches from the same tree.

selim-now 258 days ago

That would definitely make the evaluation more robust. My fear is that with LLMs at hand people became allergic to preparing good human-labelled evaluation sets and would always to some degree use an LLM as a crutch.

fxwin 257 days ago

I would agree with that