Hacker News new | ask | show | jobs
by g-mork 484 days ago
Probably it's not relevant to you commercially at the moment (or ever?), but would love some intuition on how your models perform on really low end hardware. Does this technique translate into improved CPU-only performance? Also curious about density, does the technique require more/fewer/roughly same parameters as a traditional LLM for the same output quality?
1 comments

Great question! The model can more efficiently leverage existing GPU hardware---it performs more computation per unit of memory transferred; this means that on older hardware one should be able to get similar inference speeds as one would get on recent hardware with a classical LLM. This is actually interesting commercially, since it opens new ways of reducing AI inference costs.