Hacker News new | ask | show | jobs
by sunpazed 409 days ago
The key benefit is significant lower power usage. Benchmarked llama3.2-1B on my machines; M1 Max (47t/s, ~1.8 watts), M4 Pro (62t/s, ~2.8 watts). The GPU is twice as fast (even faster on the Max), but draws much more power (~20 watts) vs the ANE.

Also the ANE models are limited to 512 tokens of context, so unlikely yet to use these in production.

1 comments

We can ran 2000 or 4000 context with ANE