|
|
|
|
|
by grandmczeb
3037 days ago
|
|
One of the core operations of the transformer network[1] is a (LxL) x (LxE) matrix multiply (where L is the sentence length and E is the network width). Can you be more specific about how you would get good performance without specializing on L? [1] https://arxiv.org/abs/1706.03762 |
|