Hacker News new | ask | show | jobs
by npalli 180 days ago
Seems very detailed and comprehensive. Did I miss it, but was there a performance comparison to the PyTorch version at the top?
1 comments

Hi thanks for feedback! That’s a good point I did compare to torch but at a high enough sequence length (~1024) torch version starts OOM because it has to materialize the S^2 in global mem. On small sequence length, torch does win solely on optimised cublas matmuls