| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mythz 564 days ago

Was disappointed in all the Phi models before this, whose benchmark results scored way better than it worked in practice, but I've been really impressed with how good Phi-4 is at just 14B. We've run it against the top 1000 most popular StackOverflow questions and it came up 3rd beating out GPT-4 and Sonnet 3.5 in our benchmarks, only behind DeepSeek v3 and WizardLM 8x22B [1]. We're using Mixtral 8x7B to grade the quality of the answers which could explain how WizardLM (based on Mixtral 8x22B) took 2nd Place.

Unfortunately I'm only getting 6 tok/s on NVidia A4000 so it's still not great for real-time queries, but luckily now that it's MIT licensed it's available on OpenRouter [2] for a great price of $0.07/$0.14M at a fast 78 tok/s.

Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3].

[1] https://pvq.app/leaderboard

[2] https://openrouter.ai/microsoft/phi-4

[3] https://pvq.app/questions/ask

4 comments

KTibow 564 days ago

Interesting eval but my first reaction is "using Mixtral as a judge doesn't sound like a good idea". Have you tested how different its results are from GPT-4 as a judge (on a small scale) or how stuff like style and order can affect its judgements?

Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper

link

mythz 564 days ago

Yeah we evaluated several models for grading ~1 year ago and concluded Mixtral was the best choice for us, as it was the best model yielding the best results that we could self-host and distribute the load of grading 1.2M+ answers over several GPU Servers.

We would have liked to pick a neutral model like Gemini which was fast, reliable and low cost, unfortunately it gave too many poor answers good grades [1]. If we had to pick a new grading model now, hopefully the much improved Gemini Flash 2.0 might yield better results.

[1] https://pvq.app/posts/individual-voting-comparison#gemini-pr...

link

KTibow 564 days ago

There are a lot of interesting options. Gemini 2 Flash isn't ready yet (the current limits are 10 RPM and 1500 RPD) but it could definitely work. An alternative might be using a fine tuned model - I've heard good things about OpenAI fine tuning with even a few examples.

link

lolinder 564 days ago

Honestly, the fact that you used an LLM to grade the answers at all is enough to make me discount your results entirely. That it showed obvious preference to the model with which it shares weights is just a symptom of the core problem, which is that you had to pick a model to trust before you even ran the benchmarks.

The only judges that matter at this stage are humans. Maybe someday when we have models that humans agree are reliably good you could use them to judge lesser-but-cheaper models.

link

segmondy 564 days ago

Yup, I did an experiment a long time ago, where I wanted best of 2. I had Wizard, Mistral & Llama. They would generate responses and I would pass the response to all 3 models to vote. I would pass it in to a new prompt without reference to previous prompt, 95%+ of the time, they all voted for their own response even when it was clear there was a better response. LLM as a judge is a joke.

link

mythz 564 days ago

The Mixtral grading model calculates the original starting votes which can be further influenced by Users voting on their preferred answer which affects the leaderboard standings.

It should be noted that Mixtral 8x7B didn't grade its own model very high at 11th, it's standout was grading Microsoft's WizardLM2 model pretty high at #2. Although it's not entirely without merit as at the time of release it was Microsoft's most advanced model and the best opensource LLM available [1]. Which we also found generated great high quality answers which I'm surprised it's not more used as it's only OpenRouter's 15th most used model this month [2], although it's received very little marketing behind it, essentially just an announcement blog post.

Whilst nothing is perfect we're happy with the Grading system as it's still able to identify good answers from bad ones, good models from bad ones and which topics models perform poorly on. Some of the grades are surprising since we have prejudices on where models should rank before the results are concluded, which is also why it's important to have multiple independent benchmarks, especially benchmarks that LLMs aren't optimized for as I've often been disappointed by how some models perform in practice vs how well they perform in benchmarks.

Either way you can inspect the different answers from the different models yourself by paging through the popular questions [3]:

[1] https://wizardlm.github.io/WizardLM2/

[2] https://openrouter.ai/rankings?view=month

[3] https://pvq.app/questions

link

lhl 564 days ago

I tested Phi-4 with a Japanese functional test suite and it scored much better than prior Phis (and comparable to much larger models, basically in the top tier atm). [1]

The one red-flag w/ Phi-4 is that it's IFEval score is relatively low. IFEval has specific types of constraints (forbidden words, capitalization, etc) it tests for [2] but its one area especially worth keeping an eye out for those testing Phi-4 for themselves...

[1] https://docs.google.com/spreadsheets/u/3/d/18n--cIaVt49kOh-G...

[2] https://github.com/google-research/google-research/blob/mast...

link

driverdan 564 days ago

IMO SO questions is not a good evaluation. These models were likely trained on the top 1000 most popular StackOverflow questions. You'd expect them to have similar results and perform well when compared to the original answers.

link

solomatov 564 days ago

> but luckily now that it's MIT licensed it's available on OpenRouter

Did it have a different license before? If so, why did they change it?

link