This tradeoff is ridiculous, even if it is "better" by .01% F score. I would much rather have a dataset created in 1 day from BERT at 98% F-score than 1000 years at 98.01% F-score from a 540B parameter model, or even a 33B parameter model. The performance in million parameter models for NER is still excellent, and works at speed that are usable. Running things through OpenAI is also useless, as it would cost a few million $.
It's more like 100% accuracy vs 95% accuracy, and the super large models are now able to extract non-trivial derived info from a regular human speech as well. While cost-wise it's not efficient right now, this will change over time (you skate to where puck will be, not where it is now), making the current fine-tuning way obsolete. Academically I am not thrilled as I built my research on fine-tuning, but as a producer of a product this solves so many issues at the same time, making me pretty happy.
That's why the LLaMA 2 release is so significant. You can run the full 70B in 8-bit on two prosumer A6000 Ampere (cost around $10k together) which is within the reach of most companies and some devs. This could further accelerate all research to make it both efficient and available even to regular folks.
But it's still not comparable to GPT 4, nor will it likely be for some time at least. And by the time we have GPT 4 class open source models, I'd imagine there'd be significant advancements in closed source models, such as inference time symbolic reasoning using MCTS, something google is working on for gemini...or just bigger/better architectures, data etc