Hacker News new | ask | show | jobs
by AlejaGiral31 91 days ago
Wow! Excellent independent work! I can't believe Grok performed so well! How did you ensure all models were tested equally?
1 comments

Thanks! The short answer is, all models went through identical conditions: same techniques, same prompts and same scoring logic.

I routed everything through OpenRouter with a single API key, so request handling, timeout logic, and retry behavior were identical across models.

OpenRouter does direct forwarding without modifying the prompt payload. If it introduces any bias, it does so equally for all five, which preserves relative comparability.