| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nulld3v 930 days ago

Looks to be Mixture of Experts, here is the params.json:

    {
        "dim": 4096,
        "n_layers": 32,
        "head_dim": 128,
        "hidden_dim": 14336,
        "n_heads": 32,
        "n_kv_heads": 8,
        "norm_eps": 1e-05,
        "vocab_size": 32000,
        "moe": {
            "num_experts_per_tok": 2,
            "num_experts": 8
        }
    }

2 comments

sockaddr 930 days ago

What does expert mean in this context?

link

moffkalast 930 days ago

It means it's 8 7B models in a trench coat in a sense, it runs as fast as a 14B (2 experts at a time apparently) but takes up as much memory as a 40B model (70% * 8 * 7B). There is some process trained into it that chooses which experts to use based on the question posed. GPT 4 is allegedly based on the same architecture, but at 8*222B.

link

dragonwriter 930 days ago

> GPT 4 is based on the same architecture, but at 8*222B.

Do we actually either no that it is MoE or that size? IIRC both if those started as outsidr guesses that somehow just became accepted knowledge without any actual confirmation.

link

moffkalast 930 days ago

Iirc some of the other things the same source stated were later confirmed, so this is likely to be true as well, but I might be misremembering.

link

rishabhjain1198 930 days ago

In a MoE model with experts_per_token = 2 and each expert having 7B params, after picking the experts it should run as fast as the slowest 7B expert, not a comparable 14B model.

link

nullc 930 days ago

Only assuming it's able to hide the faster one in free parallelism.

link

moffkalast 929 days ago

My CPU trying its best to run inference: parallelwhat?

link

tavavex 930 days ago

Does anyone here know roughly how an expert gets chosen? It seems like a very open-ended problem, and I'm not sure on how it can be implemented easily.

link

rishabhjain1198 930 days ago

[Relevant paper](https://arxiv.org/abs/1701.06538).

TL;DR you can think of it as the initial part of the model is essentially dedicated to learning which experts to choose.

link

WeMoveOn 930 days ago

How did you come up with 40b for the memory? specifically, why 0.7 * total params?

link

moffkalast 929 days ago

It's just a rough estimate given that these things are fairly linear, the original 7B mistral was 15 GB and the new one is 86 GB, whereas a fully duplicated 8 * 15 GB would suggest a 120 GB size, so 86/120 = 0.71 for actual size, suggesting 29% memory savings. This of course doesn't really account for any multiple vs single file saving overhead and such, so it's likely to be a bit off.

link

sockaddr 930 days ago

Fascinating. Thanks

link

sp332 930 days ago

I don't see any code in there. What runtime could load these weights?

link

brucethemoose2 930 days ago

Its presumably llama just like Mistral.

Everything open source is llama now. Facebook all but standardized the architecture.

I dunno about the moe. Is there existing transformers code for that part? It kinda looks like there is based on the config.

link