Hacker News new | ask | show | jobs
by wyager 20 days ago
This description falls apart for two reasons

1. It only accurately describes pre-training 2. It ignores the existence of generalization

Next token prediction is just a training task, not "what the model does internally" in any meaningful sense