|
|
|
|
|
by shekhar101
926 days ago
|
|
Explanation from Andrej karpathy makes sense on why:
'''
"8x7B" name is a bit misleading because it is not all 7B params that are being 8x'd, only the FeedForward blocks in the Transformer are 8x'd, everything else stays the same. Hence also why total number of params is not 56B but only 46.7B. ''' |
|