|
|
|
|
|
by jmathai
151 days ago
|
|
You may also be getting a worse result for higher cost. For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary. Still waiting on human evaluation to confirm the LLM Judge was correct. |
|