|
|
|
|
|
by kla-s
611 days ago
|
|
Please someone correct me if I’m wrong, but my understanding of ML/LLMs is that this kind of hand crafting has been tried, but it is easier to train/less finicky to let behavior like this emerge from more data, see [1] “Bitter Lesson” and [2] “Scaling Laws”. MAMBA as an architecture claims to have some significant gains performance wise, but to my knowledge there haven't been any really large models (>~100B params) with open weights/leaked MAMBA architecture been disclosed other than this (7B). As mentioned by other comments, another dimension not to forget is the training data quality. Not only quantity but also quality really matters, is what we are learning more and more with LLMs.. [1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
[2] see eg https://m.youtube.com/watch?v=5eqRuVp65eY&pp=ygUMU2NhbGluZyB... for a well made/easily digestable intro |
|
https://arxiv.org/abs/2408.12570