Hacker News new | ask | show | jobs
by 51109 3830 days ago
Thanks for the link to the paper. I think the older criticisms still stand:

- They created their own data set. Instead of a general Q-A system, they may have overfit to this particular task and question types.

- They set the human intelligence benchmark by using Mechanical Turk. This may not be representative of true human intelligence (given the lower quality of Mechanical Turkers).

As a future work, I hope they look at the work the Allen AI institute is doing with Aristo [1]

There is a current challenge [2] to beat 8th graders on a standardized 8th grade science exam. Here the data set is made by someone else, the questions are closer to real-life questions (vs. simple analogies), progress and results can be compared to other research teams and the human benchmark is set by the actual performance of 8th graders.

For human intelligence, next to accuracy and speed, we also care about simplicity. This system nails accuracy and speed, but it may be beaten by someone who has never read the entirety of Wikipedia. A deep net trained on millions of words and their relations may be too complex for this task (it uses up a lot of energy to train).

As to computer intelligence being different than human intelligence: I once nearly aced an aptitude test where I had access to a search engine. The test involved programming languages I never programmed a single line in. Yet, searching for keywords in the question, combined with keywords in the answers, I could give correct answers, merely by comparing page count statistics. Like the robot in Searle's room, I was merely pattern matching, without a real understanding or insight into the questions asked. The result of my test leaked out on the workfloor, and for weeks I was a headline wonder (having beaten all the senior engineers' scores), without really deserving it.

[1] http://allenai.org/aristo.html [2] https://www.kaggle.com/c/the-allen-ai-science-challenge

2 comments

To add to your second point, Andrew Gelman had some blog posts earlier this year detailing the challenges of doing online surveys, where he used simple questions that respondents "should" have been able to answer in a survey and found that a fair percentage >10% answered some of these simple questions wrong. I am assuming there is no incentive for answering more questions correctly so it's possible that some respondents may have answered blindly to finish quickly leading to lower scores.
These general QA datasets wouldn't work, because their system can only do word analogies. And some similar tasks like antonyms. It can't understand a sentence or anything.