Hacker News new | ask | show | jobs
by cyanydeez 30 days ago
there are benchmarks that have nothing to do with the training material, but with how the models are capable of things like reading code: https://needle-bench.cc/

Generally, you give them a document and you ask them to retrieve some subsection of the document then rate them on what they retrieved.

You can always find enough random documents, or create your own, to always run these and you can make it arbitrarily long. It's definitely a valid non-maxxable context test.

1 comments

This seems like a viable eval strategy. Presumably finding a bug requires some degree of understanding of the code, beyond just information retrieval. However it probably does not measure things like prompt adherence or ability to create code that implements a specification?
you can extend the test pretty easily. run through design turns and ask it for it again and again. effectively measure context length.

ask it to modify lines 120-130 and add more context, etc.

we have rudimentry preLLM algoritms that can measure hamming distance and hashing.

you could even go all https://en.wikipedia.org/wiki/Jabberwocky to see if its sense of context is easily polluted.

the point though is there are benchmarks beyong pelican on a bike that cant be tokenmaxx and prove real value in capabilities