Hacker News new | ask | show | jobs
by pca006132 386 days ago
While it is decidable, people typically never produce optimal programs even for the hot path. It is just intractable and too slow to do right now.

For register allocation and instruction selection, there is hope because it is FPT and there are algorithms to do it optimally in polynomial time, albeit with a large constant factor (FPT), making it impractical to apply to compilers as of today. For instruction scheduling, it is just too hard. If you read literature on scheduling algorithms, it is NP-hard for apparently simple instances, e.g., 2 parallel identical machines with no preemption and bounding completion time (https://www2.informatik.uni-osnabrueck.de/knust/class/), while actual microarchitecture is much more complicated than this...

Needless to say, these are already the simpler problems. The longer the program or the more profiling data you can optimize for, the more tricks you can throw at it, and most of them are NP-hard to optimize optimally.

Being NP-hard doesn't imply that you can't obtain the optimal result, but compilers that I know of do not implement them, because most users are not willing to wait for days for such a compilation to complete. Ideally, one should make something that can run on clusters of CPUs or GPUs to optimize this, and people having those clusters will typically be willing to do this because they want to optimize the program they later run on the clusters. However, to my knowledge, no one is working on this at the moment.

1 comments

While you are right that in general it may be very difficult or quasi-impossible to generate optimal programs, there are also plenty of cases where an optimal program is achievable.

The latter happens when there is one dominant bottleneck for the algorithm, which is determined by the hardware, e.g. the maximum throughput of a certain instruction, like multiplication or memory load. When the implemented program reaches a throughput almost equal with that absolute limit, then one can be certain of its optimality.

Yeah, in that case I guess it is usually memory-bound? When it is memory-bound you don't really care that much about scheduling etc.
Matrix multiplies are typically compute bound, but you don't get much option to improve the actual algorithm because Nvidia gives you an accelerator for one and anything else would be slower.