Hacker News new | ask | show | jobs
by 0xDA7A 645 days ago
I think the part I find most interesting about this is the potential power implications. Ternary models may perform better in terms of RAM and that's great, but if you manage to build a multiplication-free accelerator in silicon, you can start thinking about running things like vision models in < 0.1W of power.

This could have insane implications for edge capabilities, robots with massively better swarm dynamics, smart glasses with super low latency speech to text, etc.

I think the biggest technical hurdle would be simulating the non linear layers in an efficient way, but you can also solve that since you already re-train your models and could use custom activation functions that better approximate a HW efficient non linear layer.

1 comments

The non-linear layers, particularly the softmax(QK^T), will be crucial to getting ultra-low latency and high throughput. We're considering some custom silicon just for that portion of every transformer block