Inference providers like Fireworks, or major clouds, can use this to reduce their cost, if they don't already have a replication with similar perf.
vLLM and SGLang may integrate this to be faster at serving DeepSeek-V2/V2.5/V3/R1 on H100/H800s.
I believe that's why they didn't release this back then, this is part of their "moat" (pretty weak tho) and it only benefits competitors.
Open sourcing this after being very popular may indicate that they don't want all the users to use their API/Chat and now want the world to serve it instead? Idk.
Inference providers like Fireworks, or major clouds, can use this to reduce their cost, if they don't already have a replication with similar perf.
vLLM and SGLang may integrate this to be faster at serving DeepSeek-V2/V2.5/V3/R1 on H100/H800s.
I believe that's why they didn't release this back then, this is part of their "moat" (pretty weak tho) and it only benefits competitors.
Open sourcing this after being very popular may indicate that they don't want all the users to use their API/Chat and now want the world to serve it instead? Idk.