| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dist-epoch 825 days ago

Tokenisation turns a continuous signal into a normalized discrete vocabulary: stock "went up a lot", "went up a little", "stayed flat". This smooths out noise and simplifies matching up similar but not identical signals.

> We tokenize text because text isn't numbers.

Text is actually numbers. People tried inputting UTF8 directly into transformers, but it doesn't work that well. Karpathy explains why:

https://www.youtube.com/watch?v=zduSFxRajkE

2 comments

prlin 825 days ago

> Text is actually numbers

Text can be represented by numbers but they aren't the same datatype. They don't support the same operations (addition, subtraction, multiplication, etc).

link

lamename 825 days ago

Interesting. Can you explain how this is superior and/or different from traditional DSP filters or other non-tokenization tricks in the signal processing field?

link

dist-epoch 825 days ago

Traditional DSP filters still output a continuous signal. And it's a well-explored domain, hard to imagine any low-hanging fruit there.

My intuition is the following: transformers work really well for text, so we could try turning a time series into a "story" (limited vocabulary) and see what happens.

link

lamename 825 days ago

Like this or something different?

https://github.com/gzerveas/mvts_transformer

link