|
|
|
|
|
by idle_zealot
21 days ago
|
|
I suspect that the compact nature of the syntax would require more tokens spent "thinking" to get decent results. It might be more efficient for simple code though. Either way worth testing. Surely someone must've set up a "how well LLMs handle Xlang" benchmark suite. |
|
As far as benchmarks go, I'd also like to see benchmarks that try to find what LLMs are good at. Most of the benchmarks seem designed to give LLMs hard problems and see if they can succeed. In that sense a "good" benchmark is one with a low pass rate.
But if we're going to do agentic coding we also need to know the opposite. We need to know which types of tasks given in which format LLMs will succeed at with like 95%+ accuracy. Then we can more easily build multi prompt pipelines with high confidence in each step.