| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by execveat 1178 days ago
	As a data point I'm getting >3 tokens per second for 30b model (q5_1 quantization) and >1 token per second for 60b model (q5_1 as well) on M1 Max. This is good enough for my usecase and it beats an old P40, but I have no idea what the performance on 3090/4090 would be. Keep in mind, 24GB VRAM is not enough to hold quantized 65B, so it would be using GPU + CPU in that case.