Hacker News new | ask | show | jobs
by filterfiber 916 days ago
I don't understand why they're comparing the parameter sizes to lines of code.

AFAIK you can just increase the layer parameters of a 1B model to whatever you want? Like, the difference between a 1B and 175B model can be just changing a few numbers, and not adding any LOC at all?

LOC has never been a limitation for large models, it's been the compute+training data required.

Most of the LOC is spent on optimization, and they don't address MoE or anything fancy like that?

1 comments

when you go from 1B to 175B, the model no longer fits in memory. so in practice you have to re-factor the model using tensor/pipeline parallelism. that's why it goes from 600 to 20K LOC.
It doesn't look like Cerebras mentioned the most important part, by trading model complexity due to using a vastly more capable system, they could could refactor that 600 line model effortlessly and rerun.

They can watch different layers train and find out how to optimize training or quantization, etc.

It feels like they kinda missed the forest for the trees here. The article should have focused on model architecture optimization due to the small LoC and the system having ridiculous training capacity.