|
|
|
|
|
by dartos
564 days ago
|
|
I may be missing something, but I thought that each context token would result in an 3 additional parameters per context token for self attention to build its map, since each attention must calculate a value considering all existing context |
|