|
|
|
|
|
by jbarrow
72 days ago
|
|
Very cool to see a company pushing what's possible with (relatively) tiny models! A 350M parameter trained on 28T tokens that, from the benchmarks, is competitive with Qwen3.5-0.8B. Comparing the architecture to Qwen3.5, it seems: - fewer, wider layers - mixing full attention and conv's, instead of the full+linear attention of Qwen3.5 - the vocab is about 1/4 the size |
|