Hacker News new | ask | show | jobs
by jballanc 44 days ago
I've been working on RVW, my adaptation of the standard transformer model that is capable of online continual learning without catastrophic forgetting. I finally published the first pre-print of my early experiments: https://doi.org/10.5281/zenodo.20064617

Now I'm working on expanding the work into more parameters and improving performance. I just finished an extremely harsh test of a Nemotron-flavored RVW that consisted of stretches of a random assortment of domains interspersed with long runs of single domains. Across all of it the model didn't forget (and actually improved on some of the more challenging domains). PPL on SmolTalk is still in the ~18 range, which I'd like to get lower, but this is all with only 4B params.

Currently, I'm training a Llama 3.2-flavored RVW with only about 2B params to see how that turns out. Depending on results of that, I may take it to Gemma 4 next.

1 comments

Super interesting. I'm also super into the idea of always online continual learning.

I'll check it out. Thanks for sharing.