| Very, very heterogenous and fast moving space. Depending on how they're made up, different teams do vastly different things. No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development. It's definitely an afterthought for most teams although we are starting to see increased interest. My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time. What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you. - [1] https://ai-evals.io/ (practical examples https://github.com/Alexhans/eval-ception) |
On top of that, there are combinations of models+prompts that give different results. For example a prompt could yield a great response from Claude, but the same prompt could yield a mediocre response from Gemini. Not just that but different models have different capabilities (example of this is that composite function calling doesn't work the same way for all models).
I'm asking because I'm generally curious on how teams are solving this today – and it _seems_ like there is no gold standard for evals yet although it's gaining interest.
How I do evals today is by testing an output across different dimensions (and it can vary based on use-case): relevance, instruction following, clarity, hallucination rate, etc. which sucks a lot of time (and can never be fully accurate because how do you fully measure something like "clarity"?), and I feel like there's a better way out there.