|
|
|
|
|
by mattcollins
254 days ago
|
|
I'm the person who ran the test. To explain the 60% a bit more... With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases. For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test. |
|
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1And if you have no control over model, then use CSV or Markdown Table.