| HN Mirror

These being mixture of expert (MOE) models should help. The 20b model only has 3.6b params active at any one time, so minus a bit of overhead the speed should be like running a 3.6b model (while still requiring the RAM of a 20b model).

Here's the ollama version (4.6bit quant, I think?) run with --verbose total duration: 21.193519667s load duration: 94.88375ms prompt eval count: 77 token(s) prompt eval duration: 1.482405875s prompt eval rate: 51.94 tokens/s eval count: 308 token(s) eval duration: 19.615023208s eval rate: 15.70 tokens/s

15 tokens/s is pretty decent for a low end MacBook Air (M2, 24gb of ram). Yes, it's not the ~250 tokens/s of 2.5-flash, but for my use case anything above 10 tokens/sec is good enough.