|
|
|
|
|
by moffkalast
929 days ago
|
|
It means it's 8 7B models in a trench coat in a sense, it runs as fast as a 14B (2 experts at a time apparently) but takes up as much memory as a 40B model (70% * 8 * 7B). There is some process trained into it that chooses which experts to use based on the question posed. GPT 4 is allegedly based on the same architecture, but at 8*222B. |
|
Do we actually either no that it is MoE or that size? IIRC both if those started as outsidr guesses that somehow just became accepted knowledge without any actual confirmation.