| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chaxor 1588 days ago
	The tokens in most gpt models are small like this, but they still 'learn tokenization' very similar to what you just mentioned. It's part of the multi headed attention. It learns what level of detail in the tokenization is needed for given tasks. For example, If you're not interested in parsing the problem for actually doing the computation for example, you don't pay attention to the finer tokenization'. If you do need that level of detail, you use those finer groupings. Some of the difficulty a few years ago was trying to extend these models to handle longer contexts (or just variable contexts which can go to very long), but that also seems close to solved now too. So you're not exactly giving much insight with this observation.