Hacker News new | ask | show | jobs
by rachofsunshine 749 days ago
Yeah, as someone with an education background I suspect GPT-4 is relatively close to the general public's performance on this problem. Many people would miss AIW, and almost all would miss AIW+. I'm about as good at this kind of thing as anyone and I'd need a minute with pencil and paper to handle AIW+; it's on par with the most difficult problems found on tests like the GRE.

I wonder if these models, trained on data from across the internet, are in some ethereal way capturing the cognitive approaches of the average person (and not picking the best approaches). If the average person does not think in these sorts of symbolic-manipulative terms, and therefore does not write in those terms, and you train a model on that writing...?

1 comments

I wonder the same thing. If any academic reading this wants a paper idea:

1. Examine papers and other claims that an LLM gets something wrong that a human would have gotten wrong. How many of those claims have any citations about how many humans actually get it wrong? How many of those citations use the general population instead of the population of people who would be uniquely well-suited to answering the question correctly (i.e. people who signed up for the GRE are more likely to get GRE questions right than the general population).

2. For claims that are totally missing citations on human performance, run some tests with humans from the general population (or as close as you can get), and see how the LLMs compare.