Hacker News new | ask | show | jobs
by lukemerrick 752 days ago
Just skimmed so far and didn't see any reference to the Simplified Transformer block of https://arxiv.org/abs/2311.01906 (and it seems they also left out grouped query attention, too, as pointed out by another comment).

While lazy me wants them to explain how their approach compares to these approaches, it looks like their exposition is pretty clear (quite nice for a preprint!) and I guess I'll just have to actually read the paper for real to see for myself.

Given how well I've seen Simplified Transformer blocks work in my own playground experiments, I would not at all be surprised if other related tweaks work out well even on larger scale models. I wish some of the other commenters here had a bit more curiosity and/or empathy for these two authors who did a fine job coming up with and initially testing out some worthwhile ideas.