Hacker News new | ask | show | jobs
by nullc 615 days ago
They're memory bandwidth limited, you can basically just estimate the performance from the time it takes to read the entire model from ram for each token.