Med-PaLM 2 reaches a score of 86.5% on MEDQA (USMLE)

Y	Hacker News new \| ask \| show \| jobs

	Med-PaLM 2 reaches a score of 86.5% on MEDQA (USMLE) (arxiv.org)
	3 points by archiv 1132 days ago

1 comments

ftxbro 1132 days ago

This is amazing, the physicians rate it better than answers by other physicians across the board. The only exception is sometimes it gives some irrelevant information. You might think this means the AI has more chance to give harmful information but they checked this and no, the physicians think the other physicians are the ones who give more harmful advice.

> "We developed this model using a combination of an improved base LLM (PaLM 2 [4]), medical domain-specific finetuning and a novel prompting strategy that enabled improved medical reasoning."

> "The PaLM 2 pre-training corpus is composed of a diverse set of sources: web documents, books, code, mathematics, and conversational data. The pre-training corpus is significantly larger than the corpus used to train PaLM"

It sparks joy in my heart that this AI doctor, who answers medical questions better than actual physicians, is almost certainly trained on 4chan greentexts and shitposts.

> "Interestingly, we see a drop in performance between GPT-4-base and the aligned (production) GPT-4 model on these multiple-choice benchmark"

This is not interesting if you know about GPT-4. When they lobotomized it, it became worse at everything except refusing to answer some questions.

link

archiv 1132 days ago

Well not exactly - Table 6 is a bit concerning. It's meant to show that despite significant data contamination between testing and training datasets, the model still performs 'well'. Except look at those confidence intervals - they're all over the place, meaning the model performance isn't very reliable. You have confidence intervals comprising 30-40 percentage points!

link

ftxbro 1132 days ago

Well not exactly - Table 6 isn't concerning at all to me.

> look at those confidence intervals - they're all over the place, meaning the model performance isn't very reliable. You have confidence intervals comprising 30-40 percentage points!

OK let's look at the confidence intervals. Let's look at the first row of the table. It has a pretty big confidence interval [76.0, 100.0] for its 'Performance (with Overlap)' entry.

Let's dig in to what this really means.

We can see from the first entry in that row only 12 of the 1273 questions from the 'MedQA (USMLE)' dataset had overlap (as they define it) with text in the training data. This is fewer than one percent of the questions. It's also a small absolute number. Twelve. The 'Performance (with Overlap)' entry is an attempt to judge how well the model does on these twelve questions. On average it does pretty well but there is a wide confidence interval because the number twelve, the number of questions that overlapped the training data, is so small. Because it's so small, it doesn't provide much of a sample size to estimate exactly how well the model does for questions that it has already seen.

Your objection is basically like 'there is so little data contamination between testing and training datasets, that there is not enough sample size to tell exactly how well the model does on questions that it has seen in training' which wouldn't make much sense as an objection. In particular, it doesn't mean that "the model performance isn't very reliable."

link

archiv 1132 days ago

I see what you mean although I was specifically referring to the delta column. I might be misunderstanding it - can you explain the huge flux in the confidence intervals there? That’s what struck me.

link

ftxbro 1132 days ago

> I was specifically referring to the delta column. I might be misunderstanding it - can you explain the huge flux in the confidence intervals there?

It's essentially the same reason. Some delta confidence intervals are wide like 'PubMedQA' for which only 6 test questions had overlap (as they define it) with the training data. The small sample size 6 made that interval wide. Some delta confidence intervals are much smaller like 'MedMCQA' which had 893 questions with overlap out of 4183 total questions. The large sample sizes for both classes (with overlap and without overlap) made that interval much more narrow.

link