I tested Phi-4 with a Japanese functional test suite and it scored much better than prior Phis (and comparable to much larger models, basically in the top tier atm). [1]
The one red-flag w/ Phi-4 is that it's IFEval score is relatively low. IFEval has specific types of constraints (forbidden words, capitalization, etc) it tests for [2] but its one area especially worth keeping an eye out for those testing Phi-4 for themselves...