|
|
|
|
|
by andrewmcwatters
317 days ago
|
|
I think anyone frequenting HN and actually using these tools absolutely knows these benchmarks are 100% bullshit and the only real way to test these things is to just use them yourself. Many small models are supposedly good for controlled tasks, but given a detailed prompt, I can't get any of them to follow simple instructions. They usually just regurgitate the examples in the system prompt. Useless. |
|