|
First, LLMs are not AGI. Never will be. Can we talk now? > if given the entire test set. I don't want the entire test set. Or any single one in the test set. The problem here is ARC challenge deliberately give a training set with different distribution than both the public and the private test set. It's like having only 1+1=2, 3+5=8, 9+9=18 in training set and then 1+9=10, 5*5=25, 16/2=8, (0!+0!+0!+0!)!=24 in test set. I can see the argument of "giving the easy problems as demonstration of rules and then with 'intelligence' [1] you should be able to get harder ones (i.e. a different distribution)", but I don't believe it's a good way to benchmark current methods, mainly because there are shortcuts. Like I can teach my kids how factorial works and ! means factorial, instead of teaching them how addition works only and make them figure out how multiplication, division and factorial works and what's the notation. [1] Whatever that means. |
Like only having [1+1=2, 4+5=9, 2+10=12] in the training set and [2*5=10, 3/4=.75, 2^8=256] in the test set would be bad, but something like [1+1=2, 3+4*2=11, 5*3=15, 2*7=14, 1+3/5=1.8, 3^3=27] vs [2+4*3=14, 3+3^2+4=16, 2*3/4+2^3/2^4=2] might not be, depending on what they're trying to test
Compositionality of information, especially of abstractions (like rules or models of a phenomenon), is a key criterion in a lot of people's attempts to operationally define "intelligence" (which I agree is overall a nebulous and overloaded concept, but if we're going to make claims about it we need at least a working definition for any particular test we're doing) I could see that meaning that the test set problems need to be "harder" in the sense that presenting compositions of rules in training doesn't preclude memorizing the combinations. But this is just a guess, I'm not involved in ARC and don't know, obviously*