Hacker News new | ask | show | jobs
by ACCount37 249 days ago
Not exactly "vanilla Transformer", but rather "a Transformer-like architecture with recurrence".

Which is still a fun idea to play around with - this approach clearly has its strengths. But it doesn't appear to be an actual "better Transformer". I don't think it deserves nearly as much hype as it gets.

1 comments

Right. There should really be a vanilla Transformer baseline.

With recurrence: The idea has been around: https://arxiv.org/abs/1807.03819

There are reasons why it hasn't really been picked up at scale, and the method tends to do well on synthetic tasks.