|
|
|
|
|
by anthonypasq
8 days ago
|
|
> it is well optimized for fast inference do you have any insight into the actual technical details that make this sort of things possible? I want to learn more about model architectures. Does it have to do with attention mechanisms or sparsity or something? |
|
For now, this is what NVIDIA says: