Lots of folks out there would rather skip the hassle of running their own models, and that's totally understandable. Similarly, you've got plenty of folks who'd rather pay for managed hosting services instead of dealing with the nitty-gritty of setting up everything themselves using free tools. This opens up exciting opportunities for successful companies to offer some real perks – think convenience, a smoother user experience, and lightning-fast speeds, just to name a few! All of these things save time and are worth paying for.
> Stability AI, Apple, Meta, etc are clearly at the finish line
I'm very optimistic and expect them to catch up. I've used the open models a lot, to be clear they are starting to compare to GPT3.5Turbo right now, they can't compete with GPT4 at all. GPT4 is almost a year old from when it finished training I think?
I expect open source models to stay ~1.5 years behind. That said they will eventually be "good enough".
Keep in mind too though that using and scaling GPUs is not free. You have to run the models somewhere. Most businesses will still prefer a simple api to call instead of managing the infrastructure. On top of this many business (medium and smaller) will likely find models like GPT4 to be sufficient for their workload, and will appreciate the built in "rails" for their specific usecases.
tl;dr - open models don't even compare to GPT4 yet (I use them all daily), they aren't free to run, and a API option is still preferably to a massive if not most companies.
> Keep in mind too though that using and scaling GPUs is not free. You have to run the models somewhere.
Long or medium term these will probably be dirt cheap to just run in the background though. It might be within 3-5 years since parallel compute is still growing and isn’t as bounded by moores law stagnation
I get decent performance with my 4090, enough that LLMs with exllama at 30B quantitized are very usable. But we're severely VRAM limited, especially on lower end hardware which rarely sees > 10GB of VRAM.
I don't know how much slower it could be and still be useful though. The big thing is we need more VRAM, 30B is context length limited with only 24GB of vram, I've only barely made it above 3.2k tokens before running out.
I hope you're right, that it becomes common for systems to have either dedicated TPU type stuff similar to smartphones, and that they absolutely load the crap with VRAM (which I don't think is even that expensive?)
Models will also get smaller but I'm skeptical we'll get GPT4 performance with any useful context length under 24GB VRAM any time soon.
Open Source doesn't mean free. It costs a lot of money to run models and keep models up to date, and maybe a "good enough" model runs relatively cheaply, but there's always going to be a "state of the art" that people are willing to pay for.