Hacker News new | ask | show | jobs
by janalsncm 1057 days ago
> Transformers can read the entire sequence at once and learn to “pay attention to” only the values that came earlier in time (via “masking”)

Unless the text fits into the model’s context window, this is incorrect. The self-attention layers will train via a sliding window over the text.

Learning to attend only to previous tokens is also not correct. There are a lot of ways to train a transformer. BERT is bidirectional for example.

1 comments

I mention that above this sentence in the image - glad someone was "paying attention to" the text ;)