I was under the impression that it was mostly GPU vram based but once the model is loaded, it could produce output quickly? I'm probably over-simplifying things...
The latest gpt-3.5-turbo model generates very quickly and cheaply (in part to some recently-discoverd optimization techniques... older versions cost 10x more). While the required hardware to run GPT-4 is currently unknown, it generates considerably slower on average and its much higher cost points to a higher hardware cost.
The latest gpt-3.5-turbo model generates very quickly and cheaply (in part to some recently-discoverd optimization techniques... older versions cost 10x more). While the required hardware to run GPT-4 is currently unknown, it generates considerably slower on average and its much higher cost points to a higher hardware cost.
And this is per request. It's bananas.
[0] https://www.servethehome.com/chatgpt-hardware-a-look-at-8x-n...