|
|
|
|
|
by PaulHoule
1298 days ago
|
|
I've worked on commercial systems where N<=10,000 in the evaluation set and the confidence interval there is probably not so good as 0.1% for that. For instance there is a lot of work on this data set (which we used to tune up a search engine) https://ir-datasets.com/gov2.html and sometimes it as bad as N=50 queries with judgements. I don't see papers that are part of TREC or based on TREC data dealing with sampling errors in any systematic way. |
|
- systematically addressing sampling error;
- more than 50 queries;
- more/all QRELs;
- full evaluation instead of system pooling;
- study IR not just of the English language (this has been picked up by CLEF and NTCIR in Europe and Japan, respectively)
- to devise metrics that take energy efficiency into account.
- ...
At the same time, we have to be very grateful to NIST/TREC for executing an international (open) benchmark annually, which has moved the field forward a lot in the last 25 years.