Hacker News new | ask | show | jobs
by cotran2 355 days ago
The model is compact 1.5B, most GPUs can serve it locally and has <100ms e2e latency. For L40s, its 50ms.