Hacker News new | ask | show | jobs
by 5kg 845 days ago
I'd love to see someone who has the resources train a model bigger than 2.8b and show the scaling law still holds.
1 comments

Some prior comments said those architectures lack the memory or something of a transformer. That there’s a weakness that’s keeping people using transformers. If true, I’d like to also see tests of various domains with equivalent transformer and Mamba designs to see if that difference impacted anything. From there, we’d have a better idea about whether Mamba-176B is worth the money.