| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bangaladore 259 days ago

> Of course, it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.

My understanding is generally this is not considered an obvious result. In that high parameter generalist models largely outperform lower parameter specialists.

The real issue is they tested on data in their training set. *

* Incorrect-- Edit misread parent comment.

2 comments

littlestymaar 259 days ago

> The real issue is they tested on data in their training set.

Hm, no.

They trained on a part of their synthetic set and tested on another part of the set. Or at least that's what they said they did:

> from which 1,000 were held out as a benchmark test set.

Emphasis mine.

link

_carltg 259 days ago

Yes, but due to it being derived from the same underlying source dataset, it is effectively evaluating on the training dataset, not an independent validation/ test dataset.

The difference is subtle but important. If we expect the model to truly outperform a general model, it should generalize to a completely independent set.

link

bangaladore 259 days ago

Thanks, rereading it makes it clear that you are correct.

link

disiplus 259 days ago

They did not test on the data that they tested, that's not what he wrote.

link

DetroitThrow 259 days ago

They synthetically generated 290k examples and kept 10k of them for testing.

It's worth pointing out that that's technically not testing on the training set, but looking at how similar examples are in the dataset, it's clear that severe overfitting would be unavoidable. That also makes the headline very misleading.

The weights may not be published since using it for document extraction on even the same format but with slightly different content or lengths would show how abysmal this finetune does outside of the synthetic data.

link

bangaladore 259 days ago

Thanks, rereading it makes it clear that you are correct.

link