Hacker News new | ask | show | jobs
by Kuyawa 121 days ago
If this is possible, why not all online AI engines work like this?
1 comments

This is an specific model (Llama 3.1 8B) baked in hardware form. You can only use this model but get "low" power consumption and crazy speed.

If you want to run a different model you need new hardware for that new model.

Do we understand how to scale up the hardware to the point it can run a frontier model? Because this is insane. It will be a game changer for agent systems making 10-100+ calls.
It is really a crazy speed. 15k tokens/second.
I have tried it again. This is the future of chat UI, imho.

Generated in 0,074s • 15 754 tok/s