Hacker News new | ask | show | jobs
by PaulHoule 1298 days ago
I've worked on commercial systems where N<=10,000 in the evaluation set and the confidence interval there is probably not so good as 0.1% for that. For instance there is a lot of work on this data set (which we used to tune up a search engine)

https://ir-datasets.com/gov2.html

and sometimes it as bad as N=50 queries with judgements. I don't see papers that are part of TREC or based on TREC data dealing with sampling errors in any systematic way.

1 comments

NIST's TREC workshop series uses Cyril Cleverdon's methodology ("Cranfield paradigm") from the 1960s, and more could surely be done at the evaluation front:

- systematically addressing sampling error;

- more than 50 queries;

- more/all QRELs;

- full evaluation instead of system pooling;

- study IR not just of the English language (this has been picked up by CLEF and NTCIR in Europe and Japan, respectively)

- to devise metrics that take energy efficiency into account.

- ...

At the same time, we have to be very grateful to NIST/TREC for executing an international (open) benchmark annually, which has moved the field forward a lot in the last 25 years.