| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by larkinnaire 749 days ago
	The idea that these word problems (and other LLM stumpers) are "easily solvable by humans" needs some empirical data behind it. Computer people like puzzles, and this kind of thing seems straightforward to them. I think the percentage of the general population who would get these puzzles right with the same time constraints LLMs are subjected to is much lower than the authors would expect, and that the LLMs are right in line with human-level reasoning in this case. (Of course, I don't have a citation either, but I'm not the one writing the paper.)

1 comments

rachofsunshine 749 days ago

Yeah, as someone with an education background I suspect GPT-4 is relatively close to the general public's performance on this problem. Many people would miss AIW, and almost all would miss AIW+. I'm about as good at this kind of thing as anyone and I'd need a minute with pencil and paper to handle AIW+; it's on par with the most difficult problems found on tests like the GRE.

I wonder if these models, trained on data from across the internet, are in some ethereal way capturing the cognitive approaches of the average person (and not picking the best approaches). If the average person does not think in these sorts of symbolic-manipulative terms, and therefore does not write in those terms, and you train a model on that writing...?

larkinnaire 748 days ago

I wonder the same thing. If any academic reading this wants a paper idea:

1. Examine papers and other claims that an LLM gets something wrong that a human would have gotten wrong. How many of those claims have any citations about how many humans actually get it wrong? How many of those citations use the general population instead of the population of people who would be uniquely well-suited to answering the question correctly (i.e. people who signed up for the GRE are more likely to get GRE questions right than the general population).

2. For claims that are totally missing citations on human performance, run some tests with humans from the general population (or as close as you can get), and see how the LLMs compare.