Hacker News new | ask | show | jobs
by mock-possum 108 days ago
We feed a handful of preset questions through the new AI, we collect the results, we ask another AI to score the answers based on example ‘hood’ answers we’ve written, then we have a guy sit down and use the fallout as a starting point to rank the performance of that AI, compared to all the previous ones.

Seems like it works pretty well. Our prompts and params get tweaked towards better and better results, and we get a sense of what’s worth paying more for.

1 comments

The guy who reviews all of this, is his role in the company fully dedicated to reviewing these eval pipelines?
Yeah - and he’s kind of a black box of a contractor, people kept saying his name, which is unusual, and at first I figured it was some software or other company we were using - eventually I realized it was a real guy, who we just feed LLM results to and he ranks them for us. He’s not a full time employee and I’ve never actually seen him or had any contact with him, so now I think it’s entertaining to imagine that he’s a figment of the CEO’s imagination - his alter ego that takes over after hours and obsessively reviews LLM outputs.