|
|
|
|
|
by doctoboggan
443 days ago
|
|
> In our testing, raters preferred the reports generated by Gemini Deep Research powered by 2.5 Pro over other leading deep research providers by more than a 2-to-1 margin. Are these raters experts in the field the report was written on? Did they rate the reports on factuality, broadness, and insights? These sort of tests (and RLHF in general) are the reason that LLMs often respond with "Great question, you are exactly right to wonder..." or "Interesting insight, I agree that...". I do not want this obsequious behavior, I want "correct answers"[0]. We need some better benchmarks when it comes to human preference. [0]: I know there is no objective correct answer for some questions. |
|
OpenAI's Deep Research seems oddly restricted in the number of sources it uses, eg repeating one survey article over and over. I suspect it is just too draining and demoralizing for RLHFers to check Deep Research's citations (especially without a formal bibliography).