The data that is out there is reasonably promising with WebGPU already in use in some production ML inference engines. TVM of course is way ahead of the curve as usual - https://tvm.apache.org/2020/05/14/compiling-machine-learning... though this post is quite old now.
It's still early days for pushing compute use cases to WebGPU (OctoML being super early notwithstanding). There's a small matmul in the examples directory but it only has the most basic tiling optimizations. One of my goals the next few weeks is porting the transformer block kernels from llm.c - I think that will flesh out the picture far better. If there's interest, happy to collaborate + could potentially do a writeup if there's enough interest.
There's always some tradeoffs that comes with portability, but part of my goal with gpu.cpp is to create a scaffold to experiment and see how far we can push portable GPU performance.
Since this library ends up acting as a layer on top of the listed specifications it'd be more applicable to see benchmarks comparing the performance to building on top of said specifications directly to get an idea of overhead. At that point you could layer existing generic comparisons for the specifications you listed (or anything else for that matter) instead of needing them all to be redone specifically with this in mind.
It's still early days for pushing compute use cases to WebGPU (OctoML being super early notwithstanding). There's a small matmul in the examples directory but it only has the most basic tiling optimizations. One of my goals the next few weeks is porting the transformer block kernels from llm.c - I think that will flesh out the picture far better. If there's interest, happy to collaborate + could potentially do a writeup if there's enough interest.
There's always some tradeoffs that comes with portability, but part of my goal with gpu.cpp is to create a scaffold to experiment and see how far we can push portable GPU performance.