|
|
|
|
|
by vlovich123
91 days ago
|
|
Then you need to do a little bit deeper research. No one just applies sparse attention at inference time for a model not trained for it. They do this at training time because otherwise the task performance degrades too much. |
|