|
|
|
|
|
by ersiees
1162 days ago
|
|
Very interesting that someone finally tries out muP in the real world. Do I understand the usage correctly: MuP is only used to get around choosing an lr for each size? Here I wonder how it compares to standard heuristics like the one in the OG scaling laws paper by OAI and tricks like back winding a few steps after loss explosion. For some reason muP was not trusted with the largest trainings? Why is that? |
|