Hacker News new | ask | show | jobs
by lovesdogsnsnow 760 days ago
This is interesting! Sort of a super mixture of experts model. What's the latency penalty paid with your router in the middle?

The pattern I often see is companies prototyping on the most expensive models, then testing smaller/faster/cheaper models to determine what is actually required for production. For which contexts and products do you foresee your approach being superior?

Given you're just passing along inference costs from backend providers and aren't taking margin, what's your long-term plan for profitability?

2 comments

Great question! Generally the neural network used for the router takes maybe ~20ms during inference. When deployed on prem, in your own cloud environment, then this is the only latecy. When using the public endpoints with our own intermediate server, it might add ~150ms to the time-to-first-token, but inter-token-latency is not affected.

We generally see the router being useful when the LLM application is being scaled, and cost and speed start to matter a lot. However, in some cases the output quality actually improved, as we're able to squeeze the best of GPT4 and Claude etc.

Long-term plan for profitability would come from some future version of the router, where we save the user time and money, and then charge some overhead for the router, but with the user still paying less than they would be with a single endpoint. Hopefully that makes sense?

Happy to answer any other questions!

Do you save the user data, ie, the searches themselves? What do your TOS guarantee about the use of that data?
We use this data to improve the base router by default. It's fully anonymized, and you can opt out.
Without opt out it would be a no go, so that's great to hear. What's the downside of opting out?
no down side
If I was doing this I'd negotiate a volume discount, charge the clients the base rate and pocket the difference.
definitely on the cards, we're keeping our options open here. Right now just focused on creating value though.