| HN Mirror

You were paying that price to begin with, the BPEs don't magically resolve the quadratic. BPEs only compress by maybe 3x, and the larger the context window, the worse use a Transformer makes of it so the first 1024 or so characters are the most valuable (part of the problem is that document length drops off drastically in the training corpus). There are also many formulations of Transformer attention which change that quadratic (https://www.gwern.net/notes/Attention).