Hacker News new | ask | show | jobs
by doctoboggan 443 days ago
> In our testing, raters preferred the reports generated by Gemini Deep Research powered by 2.5 Pro over other leading deep research providers by more than a 2-to-1 margin.

Are these raters experts in the field the report was written on? Did they rate the reports on factuality, broadness, and insights?

These sort of tests (and RLHF in general) are the reason that LLMs often respond with "Great question, you are exactly right to wonder..." or "Interesting insight, I agree that...". I do not want this obsequious behavior, I want "correct answers"[0]. We need some better benchmarks when it comes to human preference.

[0]: I know there is no objective correct answer for some questions.

1 comments

Even if they were subject matter experts, it's mentally exhausting to judge these things, especially if it's just for a RLHF contracting gig and you're not actually using the report for real work. Even honest and motivated testers would be tempted into relying on surface "vibes" + no immediately obvious whoppers.

OpenAI's Deep Research seems oddly restricted in the number of sources it uses, eg repeating one survey article over and over. I suspect it is just too draining and demoralizing for RLHFers to check Deep Research's citations (especially without a formal bibliography).