Y
Hacker News
new
|
ask
|
show
|
jobs
by
microtonal
312 days ago
Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).