Hacker News new | ask | show | jobs
by hamiltont 149 days ago
Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.

- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N

Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps

Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition

6 comments

I use this approach for a ticket based customer support agent. There are a bunch of boolean checks that the LLM must pass before its response is allowed through. Some are hard fails, others, like you brought up, are just a weighted ding to the response's final score.

Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).

Funny, this move is exactly what YouTube did to their system of human-as-judge video scoring, which was a 1-5 scale before they made it thumbs up/thumbs down in 2010.
I hate thumbs up/down. 2 values is too little. I understand that 5 was maybe too much, but thumbs up/down systems need an explicit third "eh, it's okay" value for things I don't hate, don't want to save to my library, but I would like the system to know I have an opinion on.

I know that consuming something and not thumbing it up/down sort-of does that, but it's a vague enough signal (that could also mean "not close enough to keyboard / remote to thumbs up/down) that recommendation systems can't count it as an explicit choice.

Here's the discussion from back in the day when this changed: https://news.ycombinator.com/item?id=837698

In practice, people generally didn't even vote with two options, they voted with one!

IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.

> IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.

No, they got rid of them most likely because advertisers complained that when they dropped some flop they got negative press from media going "lmao 90% dislike rate on new trailer of <X>".

Stuff disliked to oblivion was either just straight out bad, wrong (in case of just bad tutorials/info) and brigading was very tiny percentage of it.

Oh, didn't they remove the dislike count after people absolutely annihilated one of their yearly rewind with dislikes?
It was removed after some presidential speeches attracted heavy dislikes.
The original sin is argued to be the Youtube Rewind 2018. But it took them until 2021 to roll it out.
well, people annihilated every of their rewinds with dislikes. But yeah, that might've contributed.
YouTube never got rid of downvotes they just hid the count. Channel admins can still see it and it still affects the algorithm
Youtube always kept downvotes and the 'dislike' button, the change (which still applies today) was that they stopped displaying the downvote count to users - the button never went away though.

Visit a youtube video today, you can still upvote and downvote with the exact same thumbs up or down, the site however only displays to you the count of upvotes. The channel owners/admins can still see the downvote count and the downvotes presumably still inform YouTube's algorithms.

There is also an independent "Return Youtube Dislike" browser extension that shows the dislike numbers. It's very convenient.
That doesn't show the real number, only "a combination of scraped dislike stats and estimates extrapolated from extension user data."
How come accuracy has only 50% weight?

“You’re absolutely right! Nice catch how I absolutely fooled you”

Yes, absolutely. This aligns with what we found. It seems to be necessary to be very clear on scoring (at least for Opus 4.5).
This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks?

By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.

Not sure I'm fully following your question, but maybe this helps:

IME deep thinking hgas moved from upfront architecture to post-prototype analysis.

Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging

With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate

When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.

The shift: from "design away problems" to "evaluate into solutions."

Isn’t this just rubrics?
its a weighted decision matrix.