Hacker News new | ask | show | jobs
by craffel 2309 days ago
Yes, unfortunately we have to rely on the very brittle "exact match" method of evaluating whether an answer is correct. FWIW and perhaps surprisingly, this is the primary way question-answering systems are evaluated in common benchmarks. I totally agree that fine-tuning T5 for answer grading would be super interesting!
4 comments

I think it makes some sense to evaluate models like this, as you want to be conservative with the answers you accept (though my second example shows that it isn't always conservative), and models don't have feelings to hurt if they are docked points for not being precise enough. Humans, of course, are more sensitive.
Does that mean that answer grading would become like comparing summaries of a given text?
I'm sorry for being blunt, but is it possible that the `very brittle "exact match" method of evaluating whether an answer is correct` means value equality? Is `==` the secret sauce?
It's slightly more than that -- it also involves lowercasing and removing articles before testing for string equality.
Why are you replying to every single comment?
I think craffel (probably "Colin Raffel, Senior Research Scientist, Google Research") was directly involved in this research!
Yes, that's me! Sorry if I'm being overeager, I like talking about my research!
I think it's amazing how frequently people involved in various CS and IT things are directly participating in threads about their work here on HN.