| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hamiltont 149 days ago

Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.

- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N

Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps

Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition

6 comments

pocketarc 149 days ago

I use this approach for a ticket based customer support agent. There are a bunch of boolean checks that the LLM must pass before its response is allowed through. Some are hard fails, others, like you brought up, are just a weighted ding to the response's final score.

Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).

tomjakubowski 148 days ago

Funny, this move is exactly what YouTube did to their system of human-as-judge video scoring, which was a 1-5 scale before they made it thumbs up/thumbs down in 2010.

jorvi 148 days ago

I hate thumbs up/down. 2 values is too little. I understand that 5 was maybe too much, but thumbs up/down systems need an explicit third "eh, it's okay" value for things I don't hate, don't want to save to my library, but I would like the system to know I have an opinion on.

I know that consuming something and not thumbing it up/down sort-of does that, but it's a vague enough signal (that could also mean "not close enough to keyboard / remote to thumbs up/down) that recommendation systems can't count it as an explicit choice.

steveklabnik 148 days ago

Here's the discussion from back in the day when this changed: https://news.ycombinator.com/item?id=837698

In practice, people generally didn't even vote with two options, they voted with one!

IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.

PunchyHamster 148 days ago

> IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.

No, they got rid of them most likely because advertisers complained that when they dropped some flop they got negative press from media going "lmao 90% dislike rate on new trailer of <X>".

Stuff disliked to oblivion was either just straight out bad, wrong (in case of just bad tutorials/info) and brigading was very tiny percentage of it.

rednafi 148 days ago

Oh, didn't they remove the dislike count after people absolutely annihilated one of their yearly rewind with dislikes?

direwolf20 147 days ago

It was removed after some presidential speeches attracted heavy dislikes.

machomaster 147 days ago

The original sin is argued to be the Youtube Rewind 2018. But it took them until 2021 to roll it out.

PunchyHamster 147 days ago

well, people annihilated every of their rewinds with dislikes. But yeah, that might've contributed.

UltraSane 148 days ago

YouTube never got rid of downvotes they just hid the count. Channel admins can still see it and it still affects the algorithm

giobox 148 days ago

Youtube always kept downvotes and the 'dislike' button, the change (which still applies today) was that they stopped displaying the downvote count to users - the button never went away though.

Visit a youtube video today, you can still upvote and downvote with the exact same thumbs up or down, the site however only displays to you the count of upvotes. The channel owners/admins can still see the downvote count and the downvotes presumably still inform YouTube's algorithms.

machomaster 147 days ago

There is also an independent "Return Youtube Dislike" browser extension that shows the dislike numbers. It's very convenient.

steveklabnik 147 days ago

That doesn't show the real number, only "a combination of scraped dislike stats and estimates extrapolated from extension user data."

piskov 149 days ago

How come accuracy has only 50% weight?

“You’re absolutely right! Nice catch how I absolutely fooled you”

lorey 149 days ago

Yes, absolutely. This aligns with what we found. It seems to be necessary to be very clear on scoring (at least for Opus 4.5).

Imustaskforhelp 149 days ago

This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks?

By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.

hamiltont 149 days ago

Not sure I'm fully following your question, but maybe this helps:

IME deep thinking hgas moved from upfront architecture to post-prototype analysis.

Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging

With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate

When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.

The shift: from "design away problems" to "evaluate into solutions."

46493168 149 days ago

Isn’t this just rubrics?

8note 149 days ago

its a weighted decision matrix.