| Hey there, we fused all 24 layers of Qwen3.5-0.8B (a hybrid DeltaNet + Attention model) into a single CUDA kernel launch and made it open-source for everyone to try it. On an RTX 3090 power-limited to 220W:
- 411 tok/s vs 229 tok/s on M5 Max (1.8x)
- 1.87 tok/J, beating M5 Max efficiency
- 1.55x faster decode than llama.cpp on the same GPU
- 3.4x faster prefill The RTX 3090 launched in 2020. Everyone calls it power-hungry. It isn't, the software is.
The conventional wisdom NVIDIA is fast but thirsty. Apple Silicon is slow but sips power. Pick a side. With stock frameworks, the numbers back that up:
Setup | tok/s | Power | tok/J
RTX 3090 (llama.cpp) | 267 | 350W | 0.76
M5 Max (LM Studio) | 229 | ~130W | 1.76 Case closed. Except the 3090 has 936 GB/s of bandwidth and 142 TFLOPS of FP16 compute, and llama.cpp extracts 267 tok/s out of it. That ratio is absurd. Traditional inference dispatches one kernel per operation. For 24 layers, that's roughly 100 launches per token. Every boundary means:
- Return control to the CPU
- Dispatch the next kernel
- Re-fetch weights from global memory
- Synchronize threads Why nobody had done this yet?
Qwen3.5-0.8B isn't a vanilla transformer. It alternates:
- 18 DeltaNet layers: linear attention with a learned recurrence
- 6 Full Attention layers: standard MHA This hybrid pattern is where frontier models are heading: Qwen3-Next, Kimi Linear, all of them. DeltaNet scales linearly with context length instead of quadratically. It's new, and nobody has shipped a fused kernel for it. MLX doesn't have DeltaNet kernels at all. llama.cpp supports it generically. Everyone else is waiting. The 267 tok/s wasn't a hardware ceiling, it was the software ceiling for a brand-new architecture. We wrote a single CUDA kernel that runs the entire forward pass in one dispatch. Data stays in registers and shared memory as it flows through the network. Zero CPU round-trips, zero redundant memory fetches. - 82 blocks x 512 threads, all SMs occupied
- BF16 weights and activations, FP32 accumulation
DeltaNet recurrence runs in warp-cooperative F32 registers
- Full attention fuses QKV, RoPE, causal softmax, and output projection
- Cooperative grid sync replaces kernel launches between layers Results on the same RTX 3090, same model, same weights:
Setup | Prefill (pp520) | Decode (tg128)
Megakernel | 37,800 tok/s | 413 tok/s
llama.cpp BF16 | 11,247 tok/s | 267 tok/s
PyTorch + HF | 7,578 tok/s | 108 tok/s Then we turned the power down
Fewer wasted cycles means less heat, so we swept nvidia-smi -pl:
Power limit | Clock | Draw | tok/s | tok/J | Notes
420W (stock) | 1980 MHz | 314W | 433 | 1.38 | baseline
300W | 1935 MHz | 299W | 432 | 1.44 | -5% power, 99.8% speed
220W | 1635 MHz | 220W | 411 | 1.87 | -30% power, 95% speed
150W | 405 MHz | 150W | 194 | 1.29 | clock cliff, too aggressive At 220W we hit the sweet spot: 95% of the throughput for 70% of the power. Tighter execution converts almost directly into saved watts.
Measurement: NVML energy counters for NVIDIA, powermetrics for Apple Silicon, matching Hazy Research's Intelligence Per Watt methodology. Accelerator power only, not wall draw. Without the megakernel the 3090 barely edges out a laptop chip. With it, a five-year-old GPU beats Apple's latest on throughput, matches it on efficiency, and costs a quarter as much.
The NVIDIA vs Apple efficiency gap isn't silicon. It's software. Try it
git clone https://github.com/Luce-Org/luce-megakernel.git
cd luce-megakernel
pip install -e .
python bench_pp_tg.py Requires: NVIDIA Ampere+ (tested on 3090), CUDA 12+, PyTorch 2.0+, ~1.5GB VRAM. Code is open source (MIT): https://github.com/Luce-Org/luce-megakernel Let us know if you have any feedback |