Hacker News new | ask | show | jobs
by PheonixPharts 864 days ago
"Evaluation" has a pretty standard meaning in the LLM community the same way that "unit test" does in software. Evaluations are suites of challenges presented to an LLM to evaluate how well it does as a form of bench-marking.

Nobody would chime in on an article on "faster unit testing in software with..." and complain that it's not clear because "is it a history unit? a science unit? what kind of tests are those students taking!?", so I find it odd that on HN people often complain about something similar for a very popular niche in this community.

If you're interested in LLMs, the term "evaluation" should be very familiar, and if you're not interested in LLMs then this post likely isn't for you.

2 comments

There’s lots to evaluate. If you’re evaluating model quality, there are many benchmarks all trying to measure different things… accuracy in translation, common sense reasoning, how well it stays on topic, can you regurgitate a reference in the prompt text, how biased is the output along a societal dimension, other safety measures, etc. I’m in the field but not an LLM researcher per se, so perhaps this is more meaningful to others, but given the post it seems useful to answer my question which was what _exactly_ is being evaluated?

In particular this is only working off the encoded sentences so it seems to me that things that involve attention etc aren’t being evaluated here.

Unit testing isn't an overloaded term. Evaluation by itself is overloaded, though "LLM evaluation" disambiguates it. I first parsed the title as 'faster inference' rather than 'faster evaluation' even being aware of what LLM evaluation is, because that's a probable path given 'show' 'faster' and 'LLM' in the context window.

That misreading could also suggest some interesting research directions. Bayesian optimization to choose some parameters which guide which subset of the neurons to include in the inference calculation? Why not.