|
|
|
|
|
by zamalek
559 days ago
|
|
https://arxiv.org/abs/2410.05229 An 18% drop in accuracy (figure 8) is not insignificant. Even 4o suffered 10% loss (figure 6), and 4o isn't a small llm. Competent performance should have near zero performance loss. The simplest benchmark merely changes things like "john had 4 apples" to "Mary had 4 oranges." Performance loss due to inconsequential tokens changing is the very definition of over-fitting. |
|