Hacker News new | ask | show | jobs
by microtonal 312 days ago
Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).