|
|
|
|
|
by JoshMandel
615 days ago
|
|
>LLMD-8B achieves state of the art responses on PubMedQA over all models Hang on -- while this is a cool result, beating a limited number of models that you chose to include in your comparison does not qualify LLMD-8B as SOTA. (For example, Claude 3 Sonnet scores 10 percentage points higher.) >This result confirms the power of continued pretraining and suggests that records themselves have content useful for improving benchmark performance. In support of this conclusion, it would be informative to include an ablation study, e.g. evaluating a continued pre-training data set of the same total size but omitting medical record content from the data mix. |
|
Also definitely a good idea on the ablation study. We had some results internally based on a production-tuned version of our model that includes a much higher weighting of records-data. It's an imperfect ablation, but it supports the story -- so I think it's there, but you're right that it would be more complete to develop and include the data directly.