| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stefan_ 145 days ago
	The primary (non malicious, non stupid) explanation given here is batching. But I think you would find looking at large-scale inference the batch sizes being ran on any given rig are fairly static - there is a sweet spot for any given model part ran individually between memory consumption and GPU utilization, and generally GPUs do badly at job parallelism. I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on.

2 comments

hatmanstack 145 days ago

That's why I'd love to get stats on load/hardware/location of where my inference is running. Looking at you Trainiuim.

link

bonoboTP 144 days ago

Why do you think batching has anything to do with the model getting dumber? Do you know what batching means?

link

stefan_ 144 days ago

Well if you were to read the link you might just find out! Today is your chance to be less dumb than the model!

link

bonoboTP 144 days ago

I checked the link, it never says that the model's prediction get lower quality due to batching, just nondeterministic. I don't understand why people conflate these things. Also it's unlikely that they use smaller batch sizes when load is lower. They just likely spin up and down GPU serves based on demand, or more likely, reallocate servers and gpus between different roles and tasks.

link