|
|
|
|
|
by earslap
816 days ago
|
|
The autoregressive transformer architecture has a constant cost per token, no matter how hard the task is. You can ask the most complicated reasoning question, and it takes the same amount of computation to generate the next token compared to the simplest yes / no question. This is due to architectural constraints. Letting the LLM generate "scratch" data to compute (attend to relevant information) is a way of circumventing the constant cost limitation. The harder the task, the more "scratch" you need so more relevant context is available for future tokens. |
|