Hacker News new | ask | show | jobs
by dartos 564 days ago
I may be missing something, but I thought that each context token would result in an 3 additional parameters per context token for self attention to build its map, since each attention must calculate a value considering all existing context