| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rejuvyesh 1569 days ago

Not related to the authors and don't have the same machine, but on a V100, tiny-cuda-nn performance for the blog post matrix power example:

    Warning: FullyFusedMLP is not supported for the selected    architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
    Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
    Initial Train Loss: 5.7188
    Initial Test Loss: 5.2812
    Took: 11.41 seconds
    Train Loss: 0.0354
    Test Loss: 0.0514
    Took: 11.58 seconds
    Train Loss: 0.0327
    Test Loss: 0.0511
    Took: 11.42 seconds
    Train Loss: 0.0316
    Test Loss: 0.0505

I think almost of the time here is python overhead because if we increase the batch size 10x, it still takes the same time:

    Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
    Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
    Initial Train Loss: 5.5391
    Initial Test Loss: 5.5938
    Took: 11.03 seconds
    Train Loss: 0.0444
    Test Loss: 0.0545
    Took: 11.16 seconds
    Train Loss: 0.0388
    Test Loss: 0.0496
    Took: 11.01 seconds
    Train Loss: 0.0384
    Test Loss: 0.0490

See [gist](https://gist.github.com/rejuvyesh/6c428ea12154edbb36cd4359fa...) for the implementation.