|
|
|
|
|
by bt1a
307 days ago
|
|
i'd wager your benchmark problems require cumbersome arithmetic or are poorly worded / inadequately described. or, you're mislabeling them as basic math and logic (a domain within which LLMs have proven their strengths!) i only call this out because you're selling it and don't hypothesize* on why they fail your simple problems. i suppose an easily aced bench wouldn't be very marketable |
|
Most of the time they make a correct summation table but fail to copy correctly the sum result into a final result. That is not a tokenisation problem (you can change the output format to make sure of it). I have a separated benchmark that test specifically this, when the input is too large, the LLMs fails to accuratly copy the correct token. I suppose the positional embedding, are not perfectly learned and it sometimes cause a mistake.
The prompt is quite short, it use structured output, and I can generate a nice graph of % of good response accross difficulity of the question (which is just the total digit count of the input numbers.
LLMs have 100% success rate on theses sum until they reach a frontier, past that their accuracy collapse at various speed depending of the model.