| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by spott 701 days ago
	Depends if you can fit the whole model into vram or not. If you can’t then you need some sort of gpu parallelism, and you need some sort of communication between the different gpus. But maybe that messaging is small enough that it doesn’t majorly slow down inference. I’m not sure.