| HN Mirror

I’m totally in alignment with your blog post (other than terminology). I meant it more as a plea to all these projects that are trying to go into production without any measures of performance behind them.

It’s shocking to me how often it happens. Aside from just the necessity to be able to prove something works, there are so many other benefits.

Cost and model commoditization are part of it like you point out. There’s also the potential for degraded performance because of the shelf benchmarks aren’t generalizing how you expect. Add to that an inability to migrate to newer models as they come out, potentially leaving performance on the table. There’s like 95 serverless models in bedrock now, and as soon as you can evaluate them on your task they immediately become a commodity.

But fundamentally you can’t even justify any time spent on prompt engineering if you don’t have a framework to evaluate changes.

Evaluation has been a critical practice in machine learning for years. IMO is no less imperative when building with llms.