| That out of sample performance is a mirage. Yes it’s impressive. Yes it’s got amazing zero shot performance in domains. But there’s a pattern of failure in production which describe a limit, that shouldn’t exist if the emergent properties were stable. You can build this right now and test it. Build a sequence of agents to work on a domain you are not an expert in. Let them loose. See what happens. Do the same thing on a domain you have expertise in. Assume the number of errors you find, the number of modifications you have to make are stable for other domains. |
There may be a subtle correlation between properties needed to answer a specific out-of-sample request and in-sample features.
Unfortunately, prior to training/testing and without recognizing that correlation in the data set, I believe it's impossible to guarantee the model will include it. (Corrections welcome)