Hacker News new | ask | show | jobs
by qeternity 130 days ago
I would guess you haven't done this in practice. Yes, of course inference is memory bound at low batch sizes. This is why we run larger batch sizes!

Also there does not exist any batch size > 1 where per-request throughput is equal to bs=1. Doing any batching at all will slow all intra-batch requests down.