|
|
|
|
|
by xg15
309 days ago
|
|
Wouldn't this imply that most of the inference time storage and compute might be unnecessary? If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them? |
|