|
|
|
|
|
by serjester
327 days ago
|
|
It's interesting that it seems to the non thinking variant has actually regressed on a quite few benchmarks compared to flash-2.0. They seem to be prioritizing coding above all else. Even the thinking variant only has marginal gains on non coding. Our table parsing benchmarking has flash-2.0 at 0.84, flash-2.5-lite at 0.80 (non-thinking), flash-2.5-lite at 0.80 (thinking). Kind of unfortunate to see. [1] https://github.com/Filimoa/rd-tablebench |
|