|
|
|
|
|
by michael0x11
546 days ago
|
|
Interesting approach to model serving - the 2-4x lower TTFT compared to vLLM is impressive, but I'd be curious to see detailed benchmarks across different batch sizes and model architectures to validate those performance claims. The no rate limits policy is bold but could get expensive fast if you're not doing some clever GPU utilization under the hood. |
|
Also regarding the no rate limits, we agree this is a real challenge and it's part of why we're interested in building this as well. I think the clever GPU utilization tricks are exactly what we're building out and also looking forward to see what the various issues we're going to run into at such scale.