|
|
|
|
|
by diegof79
1 hour ago
|
|
To complement the excellent answers that I read in this thread: an eval is a test. What makes it particular for the case of AI is: - there are many situations where you can’t test using pattern matching - you don’t only like to test correct answers but voice and tone too (imagine a bank support LLM-based chatbot that answers using slang) - evals can be used to compare the performance of different implementations; given the costs of LLMs, it’s very important - running evals is more expensive than running standard tests, because you rely on the LLM calls under test, and many times they use LLMs as a judge. It means that running them in every commit of your CI/CD is very expensive - Knowing all the possible inputs for the LLM is impossible, so evals can also be run on runtime samples to detect anomalies |
|