|
|
|
|
|
by rejuvyesh
1521 days ago
|
|
Not related to the authors and don't have the same machine, but on a V100, tiny-cuda-nn performance for the blog post matrix power example: Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Initial Train Loss: 5.7188
Initial Test Loss: 5.2812
Took: 11.41 seconds
Train Loss: 0.0354
Test Loss: 0.0514
Took: 11.58 seconds
Train Loss: 0.0327
Test Loss: 0.0511
Took: 11.42 seconds
Train Loss: 0.0316
Test Loss: 0.0505
I think almost of the time here is python overhead because if we increase the batch size 10x, it still takes the same time: Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Initial Train Loss: 5.5391
Initial Test Loss: 5.5938
Took: 11.03 seconds
Train Loss: 0.0444
Test Loss: 0.0545
Took: 11.16 seconds
Train Loss: 0.0388
Test Loss: 0.0496
Took: 11.01 seconds
Train Loss: 0.0384
Test Loss: 0.0490
See [gist](https://gist.github.com/rejuvyesh/6c428ea12154edbb36cd4359fa...) for the implementation. |
|