|
|
|
|
|
by refibrillator
823 days ago
|
|
Biggest model is 30b MoE trained on 100b tokens, max sequence length 4096. A bit underwhelming compared to recent announcements like the open source Large World Model [1]. Absolutely no benchmarks against GPT4 present in the paper. Notably they used instruction response pairs generated from GPT4 for supervised fine tuning. Which has always felt like an experimental hack to me, but that’s how many folks are bootstrapping smaller models these days, and the effectiveness is hard to argue with. Apple’s axlearn framework was used which leverages JAX and XLA [2]. [1] https://news.ycombinator.com/item?id=39367141 [2] https://github.com/apple/axlearn |
|