Hacker News new | ask | show | jobs
by valine 389 days ago
That’s true yeah. The model can do that because calculating latents is independent of next token prediction. You do a forward pass for each token in your sequence without the final projection to logits.