|
|
|
|
|
by lhl
34 days ago
|
|
When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this: ### Diagnosing parallelism pathologies (L1) *Grid occupancy:*
- `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)?
- < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization
will NOT help.
- 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure.
- 1.0-4.0 = healthy; micro-optimization can help. *Within-block distribution:*
- Does the kernel do useful work across all threads, or is there an
`if (threadIdx.x == 0)` gate around a serial top-k, reduction, or
scan? For c=1 decode, many kernels can't grow the grid, but they can
always parallelize inside the block.
- `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a
strong secondary signal of the within-block pathology. *Router top-k (within-block fix)*:
- Kernel: `qwen35_router_select_kernel` @ c=1 decode
- Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)`
gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays.
- Fix: warp-shuffle parallel argmax across the whole block + `__shared__`
top_vals buffer eliminating the spill.
- Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E. |
|