Hacker News new | ask | show | jobs
by fwsgonzo 1295 days ago
It also slightly annoys me a bit the things JIT people write on their github READMEs about the incredibly theoretical improvements that can happen at runtime, yet it's never anywhere close to AOT compilation. Then you can add 2-3x on top of that for hand-written assembly.

I do wonder whats going on with projects like BOLT though. I have seen it was merged into LLVM, and I have tried to use it but the improvement was never more than 7%. I feel like it has a lot of potential because it does try to take run-time into account.

See: https://github.com/llvm/llvm-project/tree/main/bolt

3 comments

> improvement was never more than 7%.

If your use case isn't straining icache then you won't benefit as much.

BTW 7% is huge, odd that you would describe it as "only".

> BTW 7% is huge, odd that you would describe it as "only".

It depends on what you're doing and how optimized the baseline performance is. In my area (CRDTs) the baseline performance is terrible for a lot of these algorithms. Over about 18 months of work I've managed to improve on automerge's 2021 performance of ~5 minutes / 800MB of ram for one specific benchmark down to 4ms / 2MB. Thats 75000x faster. (Yjs in comparison takes ~1 second.)

Almost all of the performance improvement came from using more appropriate data structures and optimizing the fast-path. I couldn't find an off-the-shelf b-tree or skip list which did what I needed here. I ended up hand coding a b-tree for sequence data which run-length encodes items internally, and knows how to split and merge nodes when inserts happen. CRDTs also have a lot of fiddly computations when concurrent changes edit the same data, but users don't do that much in practice. Coding optimized fast paths for the 99% case got me another 10x performance improvement or so.

I'd take another 7% performance improvement on top of where I am, but this code is probably fast enough. I hear you that 7% is huge sometimes, and a smarter compiler is a better compiler. But 7% is a drop in the bucket for my work.

7% is huge in the context of compilers, which optimize general-purpose code.
7% is in the ballpark of the speedup most programs get from changing the allocator to not give almost every allocation with the same huge alignment and around half the speedup most programs get from using explicit huge pages. These changes are both a lot easier, but e.g. Microsoft doesn't think it's worthwhile to allow developers to make the latter change at all, over 26 years after the feature shipped in consumer Intel CPUs.
That's unfortunate. I wrote a VMM that tries to back memory with hugepages (even the guests page tables). It's making a difference!
> about the incredibly theoretical improvements that can happen at runtime

Which in the majority of cases can be achieved by profile guided optimization anyways.

It should be part of these discussions to proof what you claim. Always. With code samples, directly to the compiler and corresponding assembler.

https://godbolt.org/

Statistics are worthless alone, at the end all that counts is the arena of performance and what the code becomes and how it runs against the handcrafted version.

Godbolt doesn’t accurately show runtime speed of algorithms on input data, which is what you need when discussing simd performance. And often these are proprietary industry algorithms that are the core of a business’s model.

I’m all for transparency but I’m also not about to get fired for posting our kernel convolution routines, or least squares fit model.

> It should be part of these discussions to proof what you claim

Further - these aren’t subjective claims that need to be proven on a forum for legitimacy. It’s the literal state of vector based optimisations in the compiler world right now. It is a hard problem and for the time being humans are much better at it. This is quite a large area of academic research at the moment.

If someone is so uninformed of this domain that they don’t know this, the burden is on that person to learn what the industry is talking about. Not the people discussing the objective state of the industry.

Godbolt takes practice to read. Often people who are incapable of understaning when you can beat a compiler cannot also be shown a Godbolt snippet in good faith.