Hacker News new | ask | show | jobs
by kouteiheika 806 days ago
The downside is that it's bad (like, really bad) on a certain subset of tasks. I once trained RWKVv4 model on a machine translation task and no matter how much I scaled it up it just didn't work at all, while an equivalent transformer did the job without a problem.

Intuitively this does make sense, because a transformer can at any time "look back" at the source sentence and at what it has previously generated (due to its attention mechanism) for every token it outputs, while an RNN like RWKV has to compress this into its internal state which is both lossy and limited in size.

I haven't looked at the new versions of RWKV (apparently we're at v6 now), but hopefully it performs better now. In the end I think that a hybrid architecture probably makes the most sense - have some sort of an attention mechanism for the near context, and an RNN-like state for far context, and that would give you the best of both worlds.

1 comments

What about multiple passes over the data? Make it recurse.
I also tried that - try to get it to iteratively "refine" its translation. I don't remember all of the details at this point, but in general it didn't help much. (Although maybe I just did it suboptimally and there might have been a better way to do it.)

I'm guessing scaling the model up massively would probably make it work in one shot (so that whatever it was translating would fit into its state), but I didn't really have the compute to try that.

Lstm?