Hacker News new | ask | show | jobs
by ozinenko 3046 days ago
Section 7 of the paper (https://arxiv.org/abs/1802.04730) has a couple of examples.

In short, yes CuDNN is fast for the cases it was tuned for. It is probably faster on power-of-two sizes, but when you operate on a 26 x 1024954 x 3 tensor, TC can generate specialized code. Want 42 x 17 x 5? TC can generate differently specialized code. With almost no effort from the user (or performance engineers).

Can a performance expert do better job than TC optimizer? Very likely yes, but it will very likely take much more time.

TC is not a framework. It can be integrated with any framework of your liking.