| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by satisfice 358 days ago

This reads like a collection of ad hoc advice overfitted to experience that is probably obsolete or will be tomorrow. And we don’t even know if it does fit the author’s experience.

I am looking for solid evidence of the efficacy of folk theories about how to make AI perform evaluation.

Seems to me a bunch of people are hoping that AI can test AI, and that it can to some degree. But in the end AI cannot be accountable for such testing, and we can never know all the holes in its judgment, nor can we expect that fixing a hole will not tear open other holes.

2 comments

simonw 358 days ago

Hamel wrote a whole lot more about the "LLM as a judge" pattern (where you use LLMs to evaluate the output of other LLMs) here: https://hamel.dev/blog/posts/llm-judge/

link

padolsey 357 days ago

I really recommend people study the measurement frailties and prompting sensitivities of LLM judges before employing them. They're valuable, but should be used with complete understanding of the risks: https://www.cip.org/blog/llm-judges-are-unreliable

link

hamelsmu 358 days ago

Appreciate it, Simon! I have now edited my post to include links to "intro to evals" for those not familiar.

link

petesergeant 358 days ago

> This reads like a collection of ad hoc advice overfitted to experience that is probably obsolete or will be tomorrow

Even if it is (and very specifically I don't think it is), you've got to start somewhere, and I've not seen advice better than Hamel's kicking about anywhere. His writing helped me get my start on my own evals some months ago, for sure.

link