Hacker News new | ask | show | jobs
by lpasselin 843 days ago
The mamba paper shows significant improvements in all model sizes, up to 1b, the largest one tested.

Are there any reason why it wouldn't scale to 7b or more? Have they tried it?

1 comments

That's the issue - I keep hearing that it is beyond small research group's budget to meaningfully train such a large model. You don't just need GPU time, you also need data. And just using the dregs of the internet doesn't cut it.