|
|
|
|
|
by jononor
26 days ago
|
|
This seems like a viable eval strategy. Presumably finding a bug requires some degree of understanding of the code, beyond just information retrieval.
However it probably does not measure things like prompt adherence or ability to create code that implements a specification? |
|
ask it to modify lines 120-130 and add more context, etc.
we have rudimentry preLLM algoritms that can measure hamming distance and hashing.
you could even go all https://en.wikipedia.org/wiki/Jabberwocky to see if its sense of context is easily polluted.
the point though is there are benchmarks beyong pelican on a bike that cant be tokenmaxx and prove real value in capabilities