| HN Mirror

I don't completely agree - I think it's not about GPT-3 failing to generalize word puzzle solutions it's about the type of minimized solution that gradient descent algorithms find which will produce overwhelmingly correct outputs but may lack a useful internal organization of the semantics of the training set which may or may not translate into poor model performance on out-of-sample inputs.

It's hard to say that there is no internal organization since trillion parameter models are hard for us to summarize and we do see some semantic vector alignment in the GPT models but the toy example of the 2 skull image generator present a powerful anecdote of how current ML models find correct solutions but miss a potentially valuable property of having what the paper calls factored representation which seems to be the far more "human" way to reason about data.