| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by languid-photic 45 days ago

Yes! It depends on the extent of changes needed.

If the changes needed are small, I'll apply the best implementation as a foundation and then just iterate directly.

If the changes needed are drastic, it usually signals that there was sth wrong/ambiguous/etc in the spec (or the ensemble was too weak, which is rarely the case). In cases like this, I improve the spec and then rerun.

If it's in the middle, I'll usually apply the best and write a follow on spec.

1 comments

digdugdirk 45 days ago

How does that get integrated into the scoring system? I'm imagining a scenario where a cheaper model may get close, but only needs a small follow up to get the desired result. How would this score in comparison to a larger model that got it right the first time - even if it may have been much more expensive overall?

link

languid-photic 45 days ago

We also use a secondary signal from blinded multi-verifier reviews. Each verifier ranks the candidates, and those verification outcomes serves as an additional quality signal. It's somewhat similar to consensus labeling.

Btw, this also helps manage scale. Eg you have 15 diffs to review. Run a few verifiers to get a short list, then review directly and apply the best.

link