Hacker News new | ask | show | jobs
by 6gvONxR4sf7o 1500 days ago
Judging from the abstract, it looks like that paper talks about compute tradeoffs, but do they address memory tradeoffs? Because the context length limitations for (standard) transformers is holding them back from a whole host of applications, and memory being quadratic in sequence length seems like a hell of a cost to going from BPE tokens to characters.
1 comments

You were paying that price to begin with, the BPEs don't magically resolve the quadratic. BPEs only compress by maybe 3x, and the larger the context window, the worse use a Transformer makes of it so the first 1024 or so characters are the most valuable (part of the problem is that document length drops off drastically in the training corpus). There are also many formulations of Transformer attention which change that quadratic (https://www.gwern.net/notes/Attention).