Thanks, that works. I only tested the 1.7B. It has that original GPT3 feel to it. Hallucinates like crazy when it doesn't know something. For something that will fit on a GTX1080, though, it's solid.
We're only a couple of years into optimization tech for LLMs. How many other optimizations are we yet to find? Just how small can you make a working LLM that doesn't emit nonsense? With the right math could we have been running LLMs in the 1990s?
I think that not just could, we should have had them.
As far as I understand, neural networks were very hyped in 60s and 70s and when hype bust, they've fallen out of focus. Hardware was not there yet.
Then they were neglected for many years and really pioneer science was apparently only done by Google. Theoretical breakthroughs came in 2010s, after GPT-2 masses attention caught up and we (over)focused on neural networks again. GPT-2 was way below the capabilities of current hardware, we quickly caught up and now we're optimising.
Had it not be the burst of previous hype bubble, the NN wouldn't be essentially forgotten, and we'd have steady stream of optimisations and improvements while using the maximum of currently availible hardware.
Something like voice translation model running locally should have been possible by the end of 1990s. That way we'd have steady increase of LLM capabilities, no hype, and time to adapt and understand how to properly use them with no disruption.
Good call. Right now though traffic is low (1 req per min). With the speed of completion I should be able to handle ~100x that, but if the ngrok link doesn't work defo use the google colab link.
We're only a couple of years into optimization tech for LLMs. How many other optimizations are we yet to find? Just how small can you make a working LLM that doesn't emit nonsense? With the right math could we have been running LLMs in the 1990s?