|
|
|
|
|
by archiv
1132 days ago
|
|
Well not exactly - Table 6 is a bit concerning. It's meant to show that despite significant data contamination between testing and training datasets, the model still performs 'well'. Except look at those confidence intervals - they're all over the place, meaning the model performance isn't very reliable. You have confidence intervals comprising 30-40 percentage points! |
|
> look at those confidence intervals - they're all over the place, meaning the model performance isn't very reliable. You have confidence intervals comprising 30-40 percentage points!
OK let's look at the confidence intervals. Let's look at the first row of the table. It has a pretty big confidence interval [76.0, 100.0] for its 'Performance (with Overlap)' entry.
Let's dig in to what this really means.
We can see from the first entry in that row only 12 of the 1273 questions from the 'MedQA (USMLE)' dataset had overlap (as they define it) with text in the training data. This is fewer than one percent of the questions. It's also a small absolute number. Twelve. The 'Performance (with Overlap)' entry is an attempt to judge how well the model does on these twelve questions. On average it does pretty well but there is a wide confidence interval because the number twelve, the number of questions that overlapped the training data, is so small. Because it's so small, it doesn't provide much of a sample size to estimate exactly how well the model does for questions that it has already seen.
Your objection is basically like 'there is so little data contamination between testing and training datasets, that there is not enough sample size to tell exactly how well the model does on questions that it has seen in training' which wouldn't make much sense as an objection. In particular, it doesn't mean that "the model performance isn't very reliable."