Hacker News new | ask | show | jobs
by Aurornis 40 days ago
> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.

There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.

2 comments

Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs.

> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,

https://www.tomshardware.com/tech-industry/artificial-intell...

Custom code targeting one specific hardware implementation can improve performance quite a bit.

When you support multiple backends, you end up having to abstract over them. Each backend may implement the abstraction to the best of its capability, but you still have to deal with the abstraction sitting between your workload and its compute. Wouldn't it be nice if you didn't need that abstraction? That's what GP is talking about, I'm sure: optimizing the workload directly for the hardware, rather than merely the workload and the backend for the abstraction.
Absttaction doesnt always imply performance overhead.
Abstraction necessarily reduces fit to the hardware when multiple different kinds of hardware are supported. Whether that is towards the hardware you are using varies, but in many cases it is, which means you can reach performance gains by shedding the additional support to focus on just your hardware.