| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by WildGreenLeave 522 days ago
	Correct me if I'm wrong, but, if you run multiple inferences at the same time on the same GPU you will need load multiple models in the vram and the models will fight for resources right? So running 10 parallel inferences will slow everything down 5 times right? Or am I missing something?

3 comments

Palmik 522 days ago

Inference for single example is memory bound. By doing batch inference, you can interleave computation with memory loads, without losing much speed (up until you cross the compute bound threshold).

link

bavell 522 days ago

You will most likely be using the same model so just 1 to load into vram.

link

aeternum 522 days ago

No, the key is to use the full context window so you structure the prompt as something like: For each line below, repeat the line, add a comma then output whether it most closely represents a product or service:

20 bottles of ferric chloride

salesforce

...

link

e12e 522 days ago

Appreciate the concrete advice in this response. Thank you.

link