|
|
|
|
|
by pbmonster
924 days ago
|
|
Do you have any more information on the topic? I remember reading that about significant memory savings achieved by reusing most layers. Made sense to mee on first sight to me, because you don't need to train stuff like syntax and grammar 8 times in 8 different ways. Also would explain why interference of two 7B models has the cost of running a 12B model. |
|