Hacker News new | ask | show | jobs
by diegof79 1 hour ago
To complement the excellent answers that I read in this thread: an eval is a test.

What makes it particular for the case of AI is:

- there are many situations where you can’t test using pattern matching

- you don’t only like to test correct answers but voice and tone too (imagine a bank support LLM-based chatbot that answers using slang)

- evals can be used to compare the performance of different implementations; given the costs of LLMs, it’s very important

- running evals is more expensive than running standard tests, because you rely on the LLM calls under test, and many times they use LLMs as a judge. It means that running them in every commit of your CI/CD is very expensive

- Knowing all the possible inputs for the LLM is impossible, so evals can also be run on runtime samples to detect anomalies