Hacker News new | ask | show | jobs
by lhl 40 days ago
I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.

I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.

Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.

I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.

3 comments

In the same boat with 7900xtx. 24GB vram, on paper decent performance, in reality most things don't run. Only llama.cpp is consistent that it can run most models, even if maybe not at top performance (afaik - lacking MTP, problems cache invalidation with hybrid models). At least with llama.cpp I know what runs. With various python-based inferencers, between their uv/venv, my venv, system envs/pythons/libs yadayada - I need an agent to get to the bottom of what's actually running. :-) Yeah IK skill issue/user errors - but don't have seconds in the day left to spend them on that.

Even if not perfect, if you publish on GH or HF, some other agent can maybe start there and not from zero. I did this for Ling-2.6-flash (107B-A7B4 MoE) that's the biggest llm I can ran for practical use on the other h/w I got for local llms (M2 Max). Even if MTP is not working well, still improvement on the current llama.cpp that does not run Ling-2.6-flash at all. This - https://huggingface.co/inclusionAI/Ling-2.6-flash/discussion.... The 4-bit quants are at https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, the branch is at https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas....

It's still Python, but I removed torch dependencies (HIP/C++ for hot paths): http://github.com/shisa-ai/hipEngine/

There's a docs/ folder in there that is probably of interest as well.

Doing the same for Apple M-series with fused wgsl shaders specifically targeting Qwen3/3.5.

My effort is called shady-thinker and is on github at github.com/tmzt/shady-thinker.

This was inspired in part by Antirez's earlier work with C kernels as well as other efforts to support in-browser LLMs. I've adapted them to Rust and the wgpu library.

Gemma 4 is also the next likely target (with the MTP work) as I'm experimenting with local AI agents.

I'd love to see what you've done to improve prefill and decode even if its not directly applicable.

One difference, I'm using MLX and GPTQ 4bit quants including AutoRound with safetensors as my shader pipeline is pretty much fixed for each model, ggml just adds unnecessary complexity.

Please share your knowledge and your findings

I think llama.cpp could have done a much better job supporting PC. Sure, some of it us due to bad vendor support but with so many users I am surprised we don't see more optimized inference on standard PCs

When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this:

### Diagnosing parallelism pathologies (L1)

*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.

*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.

*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.

Took a little longer to clean up than I expected. I'd recommend checking out the ROOFLINE and the LESSONS-LEARNED docs here: https://github.com/shisa-ai/hipEngine/tree/main/docs