Hacker News new | ask | show | jobs
by jononor 26 days ago
This seems like a viable eval strategy. Presumably finding a bug requires some degree of understanding of the code, beyond just information retrieval. However it probably does not measure things like prompt adherence or ability to create code that implements a specification?
1 comments

you can extend the test pretty easily. run through design turns and ask it for it again and again. effectively measure context length.

ask it to modify lines 120-130 and add more context, etc.

we have rudimentry preLLM algoritms that can measure hamming distance and hashing.

you could even go all https://en.wikipedia.org/wiki/Jabberwocky to see if its sense of context is easily polluted.

the point though is there are benchmarks beyong pelican on a bike that cant be tokenmaxx and prove real value in capabilities