| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by trott 623 days ago

> This is no different than a transformer, which, after all, is bound by a finite state, just organized in a different manner.

It's not just a matter of organizing things differently. Suppose your network dimension and sequence length are both X.

Then your memory usage (per layer) will be O(X^2), while your training update cost will be O(X^3). That's for both Transformers and RNNs.

However, at the end of the sequence, a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers.

2 comments

f_devd 623 days ago

Transformers actually have an quantifiable state size (see https://hazyresearch.stanford.edu/static/posts/2024-06-22-ac...) although it's anywhere between 200k and 2M floats (for 360M and 1.33B respectively iinm). So a sufficiently sized RNN could have the same state capacity as a transformer.

(this is from the Based paper: https://arxiv.org/pdf/2402.18668)

link

trott 623 days ago

> Transformers actually have an quantifiable state size

Are you griping about my writing O(X^2) above instead of precisely 2X^2, like this paper? The latter implies the former.

> So a sufficiently sized RNN could have the same state capacity as a transformer.

Does this contradict anything I've said? If you increase the size of the RNN, while keeping the Transformer fixed, you can match their recurrent state sizes (if you don't run out of RAM or funding)

link

f_devd 623 days ago

I was responding to

> a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers

The thing is RNN can look back infinitely if you don't exceed the state capacity. For transformers the state it is defined semi-implicitly (you can change the hidden dims but you cannot extend the look back; ignoring transformer-xl et al.) defined by the amount of tokens, for an RNN it's defined explicitly by the state size.

The big-O here is irrelevant for the architectures since it's all in the configuration & implementation of the model; i.e. there is no relevant asymptote to compare.

As an aside this was what was shown in the based paper, the fact that you can have a continuity of state (as with RNN) while have the same associative recall capability as a transformer (the main downfall of recurrent methods at that point).

link

trott 623 days ago

> The big-O here is irrelevant for the architectures since it's all in the configuration & implementation of the model; i.e. there is no relevant asymptote to compare.

NNs are like any other algorithm in this regard. Heck, look at the bottom of page 2 of the Were RNNs All We Needed paper. It has big-O notation there and elsewhere.

> I was responding to

>> a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers

In the BASED paper, in Eq. 10, sizeof(s) = 2dN. But I defined d = N = X above. Ergo, sizeof(s) = 2X^2 = O(X^2).

For minGRU, sizeof(s) = d. Ergo, sizeof(s) = X = O(X).

link

f_devd 622 days ago

That's just the state calculation which would be O(N) and O(1) respectively. The based paper is saying if you made Transformers recurrent you would have a state size of 2Nd -> O(N), while based has a state size of d*d' -> O(1).

Transformers do have O(N^2) time & memory complexity, and Based/RNN/SSM {O(N) time, O(1) mem}, with respect to sequence length if that's what you mean. The point is it doesn't really give an indication of quality.

We can choose our constant arbitrarily so the big-O you've stated only indicates memory/time-complexity not 'look-back' ability relevant to any task. If you input the entire sequence N times into an RNN, you also have perfect recall with O(N^2) but it's not exactly an efficient use of our resources.

Ideally our state memory is maximally utilized, this is the case for RNNs in the limit (although likely oversubscribed) but is not the case for transformers. The holy grail is to have an input-dependent state-size, however that is quite difficult.

link

slashdave 623 days ago

Simplistic thinking. An RNN hidden parameter space of high dimension provides plenty of room for linear projections of token histories. I think people just do not realize just how huge R^N can be.

link

trott 623 days ago

> Simplistic thinking. An RNN hidden parameter space of high dimension provides plenty of room for linear projections of token histories. I think people just do not realize just how huge R^N can be.

16N bits as hard limit, but more realistically, about 2N bits or less of useful information probably.

You'd need to grow the network dimension in proportion to the maximum sequence length just to avoid the information theoretical limit.

link