Hacker News new | ask | show | jobs
by dvt 111 days ago
I'm still not sure I fully understand the methodology. For example if Marcus makes the claim: "OpenAI sucks!" why would OpenAI's blog ever corroborate that? The sources used are all AI company blogs (Anthropic, Google, OpenAI) filled with inoffensive corpo-speak likely written to be as middle-of-the-ground as possible. In fact, I'd need an A/B test to make sure the LLM itself can properly rate various claims (positive, negative, and neutral) against such corporate sludge.

Small aside: I'm only bringing this up because last year I worked on a game where you had to solve various moral dilemmas in a 1v1 situation (think trolley experiment and one player says "flip the switch" and the other says "don't flip the switch")—the idea was to get an LLM to rate the arguments in a fun turn-based online game. I built it out, but I kind of gave up when I realize how absolutely awful the LLM was at actually rating arguments and their nuances. Who won legitimately felt more like rolling a dice than a verdict given by a real judge or a philosophy professor grading a paper. I put that project aside, but might do a Show HN at some point since the game is basically done.

Adjudication[1]—which is the real meat of this project—is done in a very partial way and I genuniely see basically zero value. Why not crawl reddit (or HN)? I know that also has issues, but it at least has more variety of tone.

[1] https://github.com/davegoldblatt/marcus-claims-dataset/blob/...