|
|
|
|
|
by stakhanov
2719 days ago
|
|
I oversimplified in my previous post. You're usually going to get to a result that's slightly better than random agreement by applying some really-not-all-that-impressive baseline strategy like looking for bag-of-words overlap between a question and a known factoid for which you have a stored answer. This strategy is easy to fool: If you have a factoid like "Dogs that chase cats are dangerous" and a question is asked like "Are cats that chase dogs dangerous?", then it might answer "Yes" because it matches the stored fact: But the answer will actually be "Yes" in more than 50% of all cases. Since dogs tend to chase cats, and cats tend not to chase dogs, the concept of cats chasing dogs is less likely to come up in a question answering context than the concept of dogs chasing cats. There are obviously numerous other ways that this is going to fail, like "Trump believes the world is flat" might be a factoid and the question might be "Is the world flat?". The result I previously oversimplified should more correctly be stated as follows: Anwers coming from the "artificial intelligence" systems submitted for the conference were analyzed to see if any answer they gave represented a deviation from the baseline that fixed an error in the baseline, versus introduced an error that the baseline wouldn't have made. The submissions were then analyzed to see if there was a pattern whereby more errors would be fixed than introduced. It was found that the statistical pattern was basically that random deviation was introduced into the baseline, and systems that, as a result of such random deviations scored very badly weren't submitted at the conference, thus generating a scores that were consistently more likely to be better than baseline rather than worse-than-baseline, but not in a meaningful way. To reiterate the source: It can be found on page 43 here: http://richard.bergmair.eu/pub/thesis.pdf But that applies to the old RTE conference. So it would be interesting to see if the same holds true here. |
|
> The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. This leaderboard is for the Challenge Set.
Additionally, I don't think they let you try often enough to get a meaningful chance at significantly beating the baseline with just pure randomness.