|
|
|
|
|
by lamchob
1150 days ago
|
|
If you read into the paper (https://dl.acm.org/doi/10.1145/3575693.3575702), one can find more performance comparisons.
There, from a latency/throughput PoV they are en par with existing tools like TVM/Ansor. Sometimes faster, sometimes slower. What is more interesting is this: They have very GPU-specific auto-tuning routine that drastically reduces the optimzation space, compared to TVM/Ansor. They go from ~10^6 possible implementations for an operator to a "few hundred", which enabled much faster time-to-solution. This is achieved with a GPU-centric problem formulation and search space. In essence, they trade how widely applicable their approach is (from "any" kind of hardware to only GPU-style architectures) for retrieval speed. |
|