|
|
|
|
|
by dist-epoch
114 days ago
|
|
Because that's where the compute happens, in those "verbose" tokens. A transformer has a size, it can only do so many math operations in one pass. If your problem is hard, you need more passes. Asking it to be shorter is like doing fewer iteration of numerical integral solving algorithm. |
|