| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kgeist 43 days ago

Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

8 comments

Aurornis 42 days ago

> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.

There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.

GeekyBear 42 days ago

Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs.

> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,

https://www.tomshardware.com/tech-industry/artificial-intell...

Custom code targeting one specific hardware implementation can improve performance quite a bit.

LoganDark 42 days ago

When you support multiple backends, you end up having to abstract over them. Each backend may implement the abstraction to the best of its capability, but you still have to deal with the abstraction sitting between your workload and its compute. Wouldn't it be nice if you didn't need that abstraction? That's what GP is talking about, I'm sure: optimizing the workload directly for the hardware, rather than merely the workload and the backend for the abstraction.

Muromec 42 days ago

Absttaction doesnt always imply performance overhead.

LoganDark 42 days ago

Abstraction necessarily reduces fit to the hardware when multiple different kinds of hardware are supported. Whether that is towards the hardware you are using varies, but in many cases it is, which means you can reach performance gains by shedding the additional support to focus on just your hardware.

xtracto 43 days ago

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

Juvination 43 days ago

I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.

slaw 43 days ago

Check out cpp at 208.3 GiB/s, 3x faster than asm.

akie 42 days ago

Yeah, because (and here's the trick) they are clever and do less work.

Optimizing things usually means "think of a way to do the same thing with less effort".

andai 42 days ago

Hire the laziest programmer :)

mirsadm 43 days ago

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

davidwritesbugs 42 days ago

I tried getting any sota llm (GPT 5, Opus 4.6, Deepseek V4 pro, glm-5) to write a Metal 4 shader for a bottle usdz and none of them got it right. They screwed up the normals and textures , total mess. I tried it to do it in Metal 3 and still crappy.

wahnfrieden 42 days ago

Just curious if you've tried GPT 5.5 Pro?

egesko 42 days ago

I'll add to this: What if chips were designed for the model? What would happen if we moved from digital to analog (vectors are not represented as bits, but instead as voltages)? Could the compute heavy matrix multiplications be done via op-amps? And could this analog approach be way more efficient than the limitations of bit representation?

kristianp 42 days ago

There is https://taalas.com/ . Their chips are all digital though. The weights are written to silicon.

joshmarlow 43 days ago

Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.

didip 42 days ago

What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?

nopurpose 42 days ago

Ultra-optimized HW-specific engines is what Mojo lang seems to be targeting, but I rarely hear about it here.

andsoitis 42 days ago

> Mojo lang seems to be targeting, but I rarely hear about it here

Momentum over at Mojo lang seems very very slow.

According to their roadmap, they're still busy on Phase 1 ("High performance CPU + GPU coding"), and haven't touched Phase 2 ("Systems application programming") and Phase 3 ("Dynamic object-oriented programming").

So perhaps there isn't much to talk about?

GeekyBear 42 days ago

They've got a lot of work yet to do to be a general purpose language, but for GPU programming they have already demonstrated that they can outperform CUDA on Nvidia GPUs.

That's pretty compelling.

p_stuart82 42 days ago

this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.