|
|
|
|
|
by claytonius
1887 days ago
|
|
I don’t think it’s straightforward to do a head to head comparison. from: https://www.youtube.com/watch?v=yso2S2Svdlg @ 25:14 James Wang: "If a model doesn’t fit into a GPU’s HBM, is it smaller when it’s laid out in the Cerebras way relative to your 18 gigabytes?" Andrew Feldman: "It is — it’s smaller in that we hold different things in memory than they do. One can imagine a model that has more parameters than we can hold — one can posit one, but remember our memory is doing different things. Our memory is basically holding parameters. That’s not what their memory is doing. Their memory is holding the shape of the model, their model is holding the results of the batches. We use memory rather differently. We haven’t found models that we can’t place and train on a chip. We expect them to emerge, that’s why we support clustering of chips and systems, that’s why we do that in whats called a “model parallel” way, where If you put two chips together you get twice the memory capacity. That’s not what you get when you put multiple GPUs together. When you put multiple GPUs together you get two versions of the same amount of memory, you actually don’t get twice the memory. I see you smiling here because you know that’s a problem… …With us if we support 4 billion parameters and you add a second wafer scale engine, now you support 8 billion parameters ,and if you add a third you can support 12 billion. That’s not the way it works with GPUs. With GPUs you just support two chips, each with a few million - tens of millions of parameters." |
|
I'm curious how the problem effectively gets sliced.