(Chat)GPT-4s practical coding abilities are now 100x because it can code, run the code, and reason about its performance mid-response. They must be using fine tunes for this so the overall model could well be better too
"model" is end-to-end, input-to-output, inclusive of the entire framework and it's guardrails and everything else
if they are able to detect hallucinations, filter them out and automatically re-run, that's a huge improvement in result, even though core model didn't get new training