|
|
|
|
|
by Imustaskforhelp
20 days ago
|
|
Your playground/write-up is very interesting and I would be really interested when you can have something like Deepseek V4 Flash model (49B) running as you are suggesting. I haven't read the article at the moment and I will try to read them hopefully but I wish to ask a question regarding, can this approach be done for say trillion or large parameter models as well or is there some wall which gets hit that makes it valuable for only smaller parameter model. That being said, its still really incredible because in future, because these small models are really getting good for many use cases and speed becomes their bottleneck, with greater speeds at consumer hardware, I think its gonna be amazing work! |
|
The authors' approach also encompasses multi-node approaches that won't apply easily to consumer inference since consumer GPUs have very low-performance interconnects, hence why layer parallelism is usually favored. (But that doesn't work very well with the monokernel approach, since it involves running distinct logic on each separate GPU. It also doesn't speed up single inference, though you can get that throughput back by pipelining small minibatches.)