Hacker News new | ask | show | jobs
by dartos 812 days ago
I mean RWKV seems promising and isn’t a transformer model.

Transformers have first mover advantage. They were the first models that scaled to large parameter counts.

That doesn’t mean they’re the best or that they’ve won, just that they were the first to get big (literally and metaphorically)

2 comments

It doesn't seem promising, a one man band has been doing a quixotic quest based on intuition and it's gotten ~nowhere, and it's not for lack of interest in alternatives. There's never been a better time to have a different approach - is your metric "times I've seen it on HN with a convincing argument for it being promising?" -- I'm not embarrassed to admit that is/was mine, but alternatively, you're aware of recent breakthroughs I haven't seen.
RWKV has shown that you can scale RNNs to large parameter counts.

The fact that one person (initially) was able to do it highlights how much low hanging fruit there is for non transformers.

Also, the fact that a small number of people designed, trained, and published 5 versions of a perfectly serviceable (as in has decent summarizing ability. The biggest LLM use case) model which doesn’t have the time complexity of transformers is a big deal.

Yeah, I'd argue that transformers created such capital saturation that there's a ton of opportunity for alternative approaches to emerge.
Speak of the devil. Jamba just hit the front page.