Hacker News new | ask | show | jobs
by kla-s 611 days ago
Please someone correct me if I’m wrong, but my understanding of ML/LLMs is that this kind of hand crafting has been tried, but it is easier to train/less finicky to let behavior like this emerge from more data, see [1] “Bitter Lesson” and [2] “Scaling Laws”.

MAMBA as an architecture claims to have some significant gains performance wise, but to my knowledge there haven't been any really large models (>~100B params) with open weights/leaked MAMBA architecture been disclosed other than this (7B).

As mentioned by other comments, another dimension not to forget is the training data quality. Not only quantity but also quality really matters, is what we are learning more and more with LLMs..

[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html [2] see eg https://m.youtube.com/watch?v=5eqRuVp65eY&pp=ygUMU2NhbGluZyB... for a well made/easily digestable intro

1 comments

Jamba 1.5 Large is 398B params (94B active) and weights are available.

https://arxiv.org/abs/2408.12570

Thanks for the link. The benchmark results aren't too impressive for its size but it likely hasn't been trained as thoroughly as llama (I couldn't find the training size in the paper but I doubt they have access to as much compute as Meta) so it still feels encouraging that it doesn't look ridiculous either.
Not as much as meta, no. But AI21 labs is partnered with Amazon and did a ~$200M funding round last year IIRC so still plenty of funds for training big models
Thanks, missed that one.

For context gpt-4 is supposedly @ 1.8T params.