Hacker News new | ask | show | jobs
by refibrillator 823 days ago
Biggest model is 30b MoE trained on 100b tokens, max sequence length 4096. A bit underwhelming compared to recent announcements like the open source Large World Model [1].

Absolutely no benchmarks against GPT4 present in the paper.

Notably they used instruction response pairs generated from GPT4 for supervised fine tuning. Which has always felt like an experimental hack to me, but that’s how many folks are bootstrapping smaller models these days, and the effectiveness is hard to argue with.

Apple’s axlearn framework was used which leverages JAX and XLA [2].

[1] https://news.ycombinator.com/item?id=39367141

[2] https://github.com/apple/axlearn

2 comments

You seem to be missing what this submission is about. It's not an Apple press release about a competing model, it's a research paper that discusses different tradeoffs in architecture and data and how each part affects the results of the trained model. In an era where training a large model can be cost prohibitive, this insight is key — it tells you where to optimize and where to cut corners to get the most bang for your buck.
> Absolutely no benchmarks against GPT4 present in the paper.

Table 4 on page 14 shows comparisons to GPT4V