| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by intended 961 days ago

That out of sample performance is a mirage.

Yes it’s impressive. Yes it’s got amazing zero shot performance in domains.

But there’s a pattern of failure in production which describe a limit, that shouldn’t exist if the emergent properties were stable.

You can build this right now and test it.

Build a sequence of agents to work on a domain you are not an expert in.

Let them loose. See what happens.

Do the same thing on a domain you have expertise in.

Assume the number of errors you find, the number of modifications you have to make are stable for other domains.

1 comments

ethbr1 961 days ago

I'd phrase characterizing the reliability of out-of-sample performance a priori as impossible, but not necessarily automatically failing.

There may be a subtle correlation between properties needed to answer a specific out-of-sample request and in-sample features.

Unfortunately, prior to training/testing and without recognizing that correlation in the data set, I believe it's impossible to guarantee the model will include it. (Corrections welcome)

link

intended 961 days ago

In essence: “You cant know in advance how far the model can approximate semantic patterns”

So claiming that out-of-sample performance is a mirage, would be a bridge too far?

link

ethbr1 961 days ago

Maybe "a mirage that might actually be true"? Which is a terrible thing to rely on! Unless it's usually true?

link

intended 961 days ago

That measurement is the core of my current tasks. If you don’t know the error rate - then what are you doing ?

link

ethbr1 961 days ago

Delivering what some executive promised when they told investors 'the company is using AI.' /s

link

intended 961 days ago

A Virtual beer/poison of choice to you and mjburgess in this thread.

link