Yeah, you're right - even though CUDA is async, doing any preprocessing (in Python) can be harder if you don't have shared memory (the start-up latency hit of multiprocessing is not a problem in this context). I've only ever encountered "embarrassingly parallel" data-feeding problems, where the memory overhead of multiprocessing was small, but I could see other situations. Comment retracted.