|
|
|
|
|
by khurdula
49 days ago
|
|
Yeah we selected models that are most commonly integrated in developer workflows and being used for structured output. Typically those models tend to be in the low -mid cost range and with no or low reasoning. For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off. Good point tho, will add this point in the blog too :) Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in. |
|