| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by craffel 2309 days ago
	Yes, unfortunately we have to rely on the very brittle "exact match" method of evaluating whether an answer is correct. FWIW and perhaps surprisingly, this is the primary way question-answering systems are evaluated in common benchmarks. I totally agree that fine-tuning T5 for answer grading would be super interesting!

4 comments

modeless 2309 days ago

I think it makes some sense to evaluate models like this, as you want to be conservative with the answers you accept (though my second example shows that it isn't always conservative), and models don't have feelings to hurt if they are docked points for not being precise enough. Humans, of course, are more sensitive.

link

lsb 2309 days ago

Does that mean that answer grading would become like comparing summaries of a given text?

link

dmit 2309 days ago

I'm sorry for being blunt, but is it possible that the `very brittle "exact match" method of evaluating whether an answer is correct` means value equality? Is `==` the secret sauce?

link

craffel 2309 days ago

It's slightly more than that -- it also involves lowercasing and removing articles before testing for string equality.

link

svnpenn 2309 days ago

Why are you replying to every single comment?

link

schoen 2309 days ago

I think craffel (probably "Colin Raffel, Senior Research Scientist, Google Research") was directly involved in this research!

link

craffel 2309 days ago

Yes, that's me! Sorry if I'm being overeager, I like talking about my research!

link

schoen 2309 days ago

I think it's amazing how frequently people involved in various CS and IT things are directly participating in threads about their work here on HN.

link