|
|
|
|
|
by withinboredom
113 days ago
|
|
But the ideas are not 'new'. A benchmark that I use to tell me if an AI is overfitted is to present the AI with a recent paper (especially one like a paxos variant) and have it build that. If it writes general paxos instead of what the paper specified, its overfitted. Claude 4.5: not overfitted too much -- does the right thing 6/10 times. Claude 4.6: overfitted -- does the right thing 2/10 times. OpenAI 5.3: overfitted -- does the right thing 3/10 times. These aren't perfect benchmarks, but it lets me know how much babysitting I need to do. My point being that older Claude models weren't overfitted nearly as much, so I'm confirming what you're saying. |
|
At any rate, with an assembler, you end up with a lot of random letter-salad mnemonics with odd use cases, so that is very likely to tokenize in interesting ways at the very least.