They do need hosting and now they need a very particular hosting with very particular hardware which is the bottleneck.
Now here is the trick - exporting the magic that makes LLM work (transformers) into ASIC hardware to get it out of the GPU. The problem being the blackbox of logic gates within the gpu that makes the LLM work.
There are a few that have figured it out. There should be more, way more. Else this will never scale and we'll be stuck within the trap of cloud - because nobody is asking for less except in their bills.
Now here is the trick - exporting the magic that makes LLM work (transformers) into ASIC hardware to get it out of the GPU. The problem being the blackbox of logic gates within the gpu that makes the LLM work.
There are a few that have figured it out. There should be more, way more. Else this will never scale and we'll be stuck within the trap of cloud - because nobody is asking for less except in their bills.