Hacker News new | ask | show | jobs
by meow_cat 825 days ago
Maybe I'm missing something obvious, but what is the idea behind quantizing and tokenizing time series? We tokenize text because text isn't numbers. In the case of time series, we're... turning numbers into less precise numbers? The benefit of scaling and centering is trivial and i guess all timeseries ML does it, but I don't see why we need a token after that.
5 comments

I'm building upon insights from this paper (https://arxiv.org/pdf/2403.03950.pdf) and believe that classification can sometimes outperform regression, even when dealing with continuous output values. This is particularly true in scenarios where the output is noisy and may assume various values (multi modal). By treating the problem as classification over discrete bins, we can obtain an approximate distribution over these bins, rather than settling for a single, averaged value as regression would yield. This approach not only facilitates sampling but may also lead to more favorable loss landscapes. The linked paper in this comment provides more details of this idea.
Isn't it a given that classification would "outperform" regression, assuming n_classes < n_possible_continuous_labels? Turning a regression problem into a classification problem bins the data, offers more examples per label, simplifying the problem, with a tradeoff in what granularity you can predict.

(It depends on what you mean by "outperform" since metrics for classification and regression aren't always comparable, but I think I'm following the meaning of your comment overall)

Tokenisation turns a continuous signal into a normalized discrete vocabulary: stock "went up a lot", "went up a little", "stayed flat". This smooths out noise and simplifies matching up similar but not identical signals.

> We tokenize text because text isn't numbers.

Text is actually numbers. People tried inputting UTF8 directly into transformers, but it doesn't work that well. Karpathy explains why:

https://www.youtube.com/watch?v=zduSFxRajkE

> Text is actually numbers

Text can be represented by numbers but they aren't the same datatype. They don't support the same operations (addition, subtraction, multiplication, etc).

Interesting. Can you explain how this is superior and/or different from traditional DSP filters or other non-tokenization tricks in the signal processing field?
Traditional DSP filters still output a continuous signal. And it's a well-explored domain, hard to imagine any low-hanging fruit there.

My intuition is the following: transformers work really well for text, so we could try turning a time series into a "story" (limited vocabulary) and see what happens.

Like this or something different?

https://github.com/gzerveas/mvts_transformer

I think it could also have a connection with symbolic AI: The discrete tokens could be the symbols that many believe is useful or necessary for reasoning. It is also useful for compression, reducing memory requirements by the quantization and small integer representations.

https://en.wikipedia.org/wiki/Neuro-symbolic_AI

My primitive understanding is that we approximate a Markovian approach and indirectly model the transition probabilities just by working through tokens.
My guess is that it enforces a kind of sparsity constraint.