| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Lienetic 742 days ago
	Are there any approaches today that you've found are at least mostly reliable? Bonus points if it is somewhat clear/easy/predictable to know when it isn't or won't be. We use human evaluation but that is naturally far from scalable, which has especially been a problem when working on more complicated workflows/chains where changes can have a cascading effect. I've been encouraging a lot of dev experimentation on my team but would like to get a more consistent eval approach so we can evaluate and discuss changes with more grounded results. If all of these metrics are low confidence, they become counterproductive since people easily fall into the trap of optimizing the metric.

1 comments

nirga 742 days ago

I tend to find classic NLP metric more predictable and stable than "LLM as a judge" metrics so I'd try to see if you rely on them more.

We've written a couple of blog posts about some of them: https://www.traceloop.com/blog

link

swyx 742 days ago

for your blog can i offer a big downvote for the massive ai generated cover image thing? its a trend for normies but for developers its absolutely meaningless. give us info density pls

link

nirga 742 days ago

roger that! I like them though (am I a normie then?)

link