|
|
|
|
|
by 0x02A
817 days ago
|
|
Thanks for trying it out! I haven't had the opportunity to benchmark this on an Intel Macbook. Were you able to see which kernel takes the most time? There should be a performance graph if you have the GUI enabled. For my Apple Silicon benchmarks, the main bottleneck is the parallel radix sort that sorts the Gaussians by tile and depth. I used a some shaders from a sorting library, but it has some performance gaps with SOTA parallel sort algorithms. I think fixing this would give a 1.5x overall performance boost and maybe 3x on Macbooks. Also the wave size isn't tuned for different GPUs. Another area of improvement is better management of the shared memory. Right now, we just let the driver manage it as the L1 cache. However, we could manage it manually and group Gaussian retrievals together for the same tile. This is what the official implementation does. Although 3DGS is the first radiance field with SOTA quality that runs in real-time, I think it's still quite heavy. Due to the explicit representation of the scene, a lot of operations are memory bound. If you can't get an interactive frame rate right now, it's unlikely the improvements will make a material difference. Hopefully that's where your work on compression comes in and solves the problem :) |
|
I still have a VulkanSplatting build from this Wednesday though. In VulkanSplatting, when looking at the Lego scene from above with the default window size, I'm getting just below 1 ms for the sorting kernel and just above 1 ms for "render", everything else is too small in the graph to register. But it only displays at a handful of fps, so it seems quite some time goes unaccounted for.
Maybe spinning up Instruments could give some more insights into what's happening? I tried `cmake -G Xcode` to have that setup easily, but the Xcode CMake generation fails with