Hacker News new | ask | show | jobs
by yvbbrjdr 250 days ago
We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.
1 comments

The parent is right, the issue is on your side.