Hacker News new | ask | show | jobs
by ademeure 98 days ago
This is very cool!

I've been working on something somewhat similar over the last few weeks, but trying to be much more general and arguably over-engineered! I like the scope of this project, keeping it limited to Triton and specific kinds of kernels makes it quite simple and efficient.

I'm confused by the progress graph though; it looks like it's benchmarking a 4096x4096x4096 fp16 matmul rather than a full repo, and it claims a 1.31x improvement vs cuBLAS... while running at 187 TFLOPS which is 18.9% of peak utilization? cuBLAS definitely gets much closer to peak than that - most likely it's limited by CPU overhead or something else? Benchmarking is hard!

Either way I'm excited to see other people working on this, I think it's an extremely promising area over the next 6 months.