| > it is now quite well established that GPT-4 has impressive out-of-sample performance Err... I can show this is false, kinda trivially. People who engage in prompt-confirmation-bias aren't aware of what the in-sample is. It's basically everything ever digitised: you can ask it for the first paragraph of every dickens novel, to what the average petal length of an iris flower is -- etc. How are you measuring the in-sample here? If you engage in straightfoward reasoning from first principles, and are basically aware of what the training data is, you can show in 10 seconds critical failures of generalisation. If you want a recipe: go find some fringe api docs. Establish that it has been trained on them. Then, since they're fringe there wont be much code on github, etc. Now ask it do something non-trivial with that API. It will fail, and the mechanism will be obvious: it'll jam in correlated code that lacks relevance. Do the same on a popular API, and see it succeed. The in-sample will be obvious for both, and the bounday of generalisation |
I am sure you will continue to argue that this is still in line with everything-thats-ever-written prediction but my opinion is that at that point, it's a meaningless distinction. The human brain is also just a machine.