|
|
|
|
|
by dist-epoch
825 days ago
|
|
Tokenisation turns a continuous signal into a normalized discrete vocabulary: stock "went up a lot", "went up a little", "stayed flat". This smooths out noise and simplifies matching up similar but not identical signals. > We tokenize text because text isn't numbers. Text is actually numbers. People tried inputting UTF8 directly into transformers, but it doesn't work that well. Karpathy explains why: https://www.youtube.com/watch?v=zduSFxRajkE |
|
Text can be represented by numbers but they aren't the same datatype. They don't support the same operations (addition, subtraction, multiplication, etc).