|
|
|
|
|
by pama
492 days ago
|
|
A big part of why R1 is much slowerr than o3-mini is that inference optimization is not yet performed on most solutions for serving R1 models (so R1 is rather comparable to o1 or o1 pro in terms of latency rather than o1-mini or o3-mini). The MoE is already relatively efficient if perfectly load balanced in an inference setting and should have latencies and throughputs that are equal to or faster than equivalent dense models with 37B parameters. In practice due to MLA inference should be much faster yet for long contexts compared to typical dense models. If DeepSeek or someone else tried to distill the model onto another MoE architecture with even less active parameters and properly implement speculative decoding on top, one could gain additional speedups in inference. I imagine we will see these things but it takes a bit of time till they are all public. |
|
I remain unconvinced that DeepSeek themselves didn't optimize their own V3 inference good enough and left another 2x~3x improvement on the table.