Hacker News new | ask | show | jobs
by powera 1698 days ago
Scoring 55% on a test like this should not be considered a great accomplishment. A sign of progress, yes, but not an accomplishment by itself.

This is still simply a system that is good at guessing. It does not know anything.

2 comments

> It does not know anything.

I would argue that it "knows" an awful lot, but it can't actually reason with it.

However impressive GPT3 type models are, I am not particularly convinced that they're much more than glorified hashtables.

If the hash table is large enough, it can produce lot of answers to a lot of questions, or approximately imitate a lot of stuff it's seen before.

Whether it can actually combine "knowledge" it has stored in its weights into a pattern it's never seen before ... I'm not convinced.

Re: "Glorified Hastable"

There is a 1-1 correspondence between data compression and generative models. GPT-2 is a highly effective loseless data compression tool: https://bellard.org/textsynth/sms.html

Always wondered why this insight is not taught as much, especially in the context of things like dimensionality reduction...

The Hutter prize for improved compression algorithms is explicitly about the relationship between compression and intelligence. http://prize.hutter1.net/
GEOS (2015) scores 49% on SAT problems and it is in geometry https://www.semanticscholar.org/paper/Solving-Geometry-Probl...

You were good at guessing!

SAT problems are multiple choice, with 5 options. So 50% is barely twice random guessing (1/5).

See how far randomly guessing an integer 1-1000 gets you with OP's word problems with freeform responses.

I think the actual guessing space for these free response problems is much smaller, through simple priors over the question. For example:

“Richard, Jerry, and Robert are going to share 60 cherries. If Robert has 30 cherries, and has 10 more than Richard, how many more cherries does Robert have than Jerry?”

A rudimentary model will likely already know the answer is between 0-60.

Knowing that the answer involves addition and subtraction narrows it down to maybe 8 answers.

While SAT problems have only 4 answers, there’s usually one trick/trap answer, which I think might be be difficult for a model to not accidentally guess. The analogy I can think of is sometimes it’s better to cover up the answers first and work out a solution, to not get biased by any particular answer choice.

"barely twice random guessing" is median score for high school students who take the SAT.