Hacker News new | ask | show | jobs
by nycdatasci 793 days ago
It's not always 7-8.

From Databricks: "DBRX has 16 experts and chooses 4, while Mixtral and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments."

https://www.databricks.com/blog/introducing-dbrx-new-state-a...