|
|
|
|
|
by st-at-picnic
615 days ago
|
|
Thanks for reading! We'll definitely include our Sonnet results in the next revision. It's worth pointing out that we're comparing accuracy on text responses and not log probability based scoring, which I think is the number you're referring to (based on Section E of this paper https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bb...). But if I'm mistaken and you have a direct pointer, that'd be super helpful! In general, we've been basing our comparisons against the models in the Open Medical LLM leaderboard here: https://huggingface.co/spaces/openlifescienceai/open_medical... Also definitely a good idea on the ablation study. We had some results internally based on a production-tuned version of our model that includes a much higher weighting of records-data. It's an imperfect ablation, but it supports the story -- so I think it's there, but you're right that it would be more complete to develop and include the data directly. |
|
I can't understand your methods without example prompts or code, so it's hard for me to interpret the data in figure 6. It will be important to document the methodology carefully to avoid concerns that your "text response" methodology is unfairly punishing other models.
In any case, since the methodology that Anthropic applied is documented and straightforward, it would be possible to do an apples to apples comparison with your model.
(I'm also very curious to know how 3.5 Sonnet performs.)
Is your text methodology based on CoT (like the "PubMedQA training dataset enriched with CoT" you trained on) or a forced single token completion like Anthropic used in their evaluation? In the latter case, I'm not sure how "text responses" differ from log probabilities at Temperature T=0 (i.e., isn't the most likely token always going to be the text response?)