| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 6gvONxR4sf7o 1500 days ago
	The alternatives are learning at the character level (way more complex, and scales badly in memory/compute), or learning at the whole word level (needs absurdly massive dictionary of words, and still can’t handle really rare/novel words). Breaking things into a set of subwords that allows you to encode any string solves lots of problems and is the relatively standard way to do things these days.

1 comments

gwern 1500 days ago

> The alternatives are learning at the character level (way more complex

No, BPEs are more complex: you have a whole additional layer of preprocessing, with all sorts of strange and counterintuitive downstream effects and brand new ways to screw up (fun quiz question: everyone knows that BPEs use '<|endoftext|>' tokens to denote document breaks; what does the string '<|endoftext|>' encode to?). BPEs are reliably one of the ways that OA API users screw up, especially when trying to work with longer completions or context windows.

But a character is a character.

> and scales badly in memory/compute)

Actually very competitive: https://arxiv.org/abs/2105.13626#google (Especially if you account for all the time and effort and subtle bugs caused by BPEs.)

link

6gvONxR4sf7o 1500 days ago

Judging from the abstract, it looks like that paper talks about compute tradeoffs, but do they address memory tradeoffs? Because the context length limitations for (standard) transformers is holding them back from a whole host of applications, and memory being quadratic in sequence length seems like a hell of a cost to going from BPE tokens to characters.

link

gwern 1499 days ago

You were paying that price to begin with, the BPEs don't magically resolve the quadratic. BPEs only compress by maybe 3x, and the larger the context window, the worse use a Transformer makes of it so the first 1024 or so characters are the most valuable (part of the problem is that document length drops off drastically in the training corpus). There are also many formulations of Transformer attention which change that quadratic (https://www.gwern.net/notes/Attention).

link