|
|
|
|
|
by mythz
517 days ago
|
|
Was disappointed in all the Phi models before this, whose benchmark results scored way better than it worked in practice, but I've been really impressed with how good Phi-4 is at just 14B. We've run it against the top 1000 most popular StackOverflow questions and it came up 3rd beating out GPT-4 and Sonnet 3.5 in our benchmarks, only behind DeepSeek v3 and WizardLM 8x22B [1]. We're using Mixtral 8x7B to grade the quality of the answers which could explain how WizardLM (based on Mixtral 8x22B) took 2nd Place. Unfortunately I'm only getting 6 tok/s on NVidia A4000 so it's still not great for real-time queries, but luckily now that it's MIT licensed it's available on OpenRouter [2] for a great price of $0.07/$0.14M at a fast 78 tok/s. Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3]. [1] https://pvq.app/leaderboard [2] https://openrouter.ai/microsoft/phi-4 [3] https://pvq.app/questions/ask |
|
Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper