Highway looks awesome from a quick glance. Definitely going to set some time aside to play with it and read it.
Sort of tangentially related, what sort of tools are you using for disassembly in 2022? Is there anything better than objdump today? What I really want is a version of godbolt that reads compile_commands.json and can show me the godbolt-style color-coded disassembly for an entire binary, any function. I find that when I’m asking these sort of questions I’m wasting a lot of time fitting my problem into godbolt so I can just get a rough idea of what’s coming out, because it’s so much faster (maybe 10x) to read the color tagged version from godbolt than it is to do the same with objdump.
Edit: along the same lines, what can people do (or rather, what do you do) about situations like this: https://github.com/google/highway/blob/master/hwy/examples/b... where things might get better in the future but for now we have to implement it another way? How do you avoid regressions and how do you track workarounds? I want some sort of database that tries competing implementations on every build and emits a compiler warning when things get better. How are you dealing with this?
Thanks! Feel free to raise Github issues if you'd like to ask/discuss anything.
I'm also a huge fan of Godbolt/Compiler Explorer. Highway is integrated there, so you can just copy in your functions. Here's the last throwaway test that I was using: https://gcc.godbolt.org/z/5azbK95j9
> things might get better in the future but for now we have to implement it another way
There's several possible answers.
1) For the issue you pointed to, that we cannot have arrays of N vectors, it's a reasonable workaround to instead allocate an array of N vectors' worth of elements, and trust the compiler to elide unnecessary Load/Store. This often works using clang, and would avoid having to manually unroll here. I do prefer to minimize the amount of compiler magic required for good performance, though, so typically we're unrolling manually as shown in that code.
2) If there are compiler bugs, typically the workarounds have to stay in because in the open-source world people are still using that compiler years later.
Automatically detecting when things get better is an interesting idea but I am not yet aware of such infrastructure.
// Compiler doesn't make independent sum* accumulators, so unroll manually.
// We cannot use an array because V might be a sizeless type. For reasonable
// code, we unroll 4x, but 8x might help (2 FMA ports * 4 cycle latency).
That code needs 2 loads per FMA. So a CPU with 2 FMA ports would need at least 4 load ports to be able to feed the 2 FMA ports. Given that most CPUs with 2 FMA ports have just 2 load ports, unrolling by 4 should be more or less ideal.
But, ideally, the compiler could make the decision based on the target architecture.
Without enabling associative math, it isn't legal to duplicate floating point accumulators and change the order of the accumulation. Perhaps compiling under `-funsafe-math` would help.
If you're using GCC, you'll probably need `-fvariable-expansion-in-unroller`, too.
I think highway looks great. I'm sure I'll procrastinate on something important to play with it reasonably soon.
Thanks :) I'd be interested to hear how it goes for you.
Agree that 4x unrolling is getting most of the low-hanging fruit without excessive code size. I saw only very slightly better performance on SKX with 8x.
You're right that it's nicer when the compiler can decide about the unrolling - for example with knowledge whether we have 16 or 32 regs. The unsafe/fast-math flags are pretty dangerous, though :/ https://simonbyrne.github.io/notes/fastmath/
Especially when they enable flush-to-zero, which would be unacceptable for a library loaded into some other application.
Interestingly from the portability angle, this was all written in C# but with the SIMD intrinsics there he was still able to obtain ~SOTA performance.
The repo I linked is his C++ version of the C# original.
It was submitted to the C# standard library but was rejected (as an observer, the rejection was disappointing) – had it been accepted C# would have been in the interesting position of having a faster sort than any native language standard library, at least for integer keys.
I've read parts 1-5, there doesn't seem to be a part6. Thanks again for the link, it is regrettable that there is duplication of effort. His approach is quite similar to ours (including HeapSort, sorting network, unrolled in-place partition), and re-invents what he calls "double-pumped partitioning" (introduced by Bramas).
Copying the blog post to arxiv and citing some related work so it comes up in alerts and cited-by searches would [have made/make] this good work visible to a wider circle, including the algorithms community.
It's nice to see a C++ version, this will be easier for me to benchmark, and it also seems to support other built-in types now, as well as AVX2+AVX-512 (though not other platforms). A quick list of differences based on reading the blog:
* Nifty idea for simplifying the branch in partition. (We also found that branchless code is worse)
* Major effort to make all reads vector-aligned. This could make a difference especially on AVX-512. I had tried two variants but it was difficult to make this safe for asan (no out of bounds reads allowed).
* The lookup table is 2KB, which could be halved using 32-bit broadcast + variable shift instead of PDEP.
* The sorting network seems to be half the size and bitonic. I haven't yet compared the number of permute instructions (not obvious from the source because it's generated).
* The pivot is only a median of three. We use a much larger sample.
Thanks for sharing results and for the write-up. It is another welcome addition to a series of recent reports attesting the SIMD instructions having entered the mainstream computing, from the high performance JSON parsing to now the quick sort.
The white paper is calling out the memory bandwidth as a major bottleneck, which is where I have a question.
The M1 Max sports a unusually wide (512 bit wide) memory bus coupled with 6.4Ghz DDR5 unified memory which, according to the available empirical evidence, allows a single CPU core at achieve 100Gb/sec circa memory transfer speed. M1 cores also feature a very large L1 D-cache as well as a large L2 cache (48 Mb has been reported). Results, however, are approximately 2.4x lower for the M1 Max. I do realise that the NEON SIMD processing will always be slower compared to AVX-512 SIMD based one due to a 4x difference of the vector size, but isn't the M1 Max supposed to perform somewhat faster than in observed figures due to the faster and much wider memory bus that would partially compensate for NEON inefficiency?
Other than the vector size difference between NEON and AVX-512, would you attribute the difference in performance to the small test batch size (~4/8/16 Mb for 32/64/128 unit sizes used in the test) thus being able to able to fit in the L2 cache nearly entirely, or due to GCC being unaware of cache line sizes on M1 Pro/Max therefore resulting in the cache underutilisation or inefficient instruction scheduling, or you would purely attribute it to NEON having not aged gracefully to meet current data processing demands?
:)
If I'm reading 100Gb/sec correctly as Gigabits, a Skylake core can also reach such bandwidth usage. The issue is that the system can only sustain that for roughly 8 cores. It would be very interesting to see bench_parallel results for an M1 Max, if anyone would like to give that a try? I suspect it will be comparable to Skylake-X, because 8 (performance) cores are not quite enough to utilize all the M1 Max bandwidth in this app. M1 may be more power efficient, though. It is also great to see more bandwidth per core. A hypothetical M2 with say 256-bit SVE vectors and 16 cores could be very interesting.
L2 caches are typically partitioned and private to a core. If that also applies to the M1 Max, then each core would only access 3 MB, thus the working set is larger than cache as intended.
I do believe NEON is the limiting factor here. I haven't looked into how many IPC we should expect, but even if it is 4 (the number of 128-bit NEON units), Skylake is often reaching 2 (with 512-bit vectors), so the measurement that an M1 core is about half as fast as SKX for this use case seems plausible.
Very cool work! What do you see as the typical process of getting code like this into use in well known codebases? I'm trying to get a sense of when a typical engineer (operating with higher level language or libraries) might get to take advantage of this.
Thanks! Highway is included in Chrome and Firefox. Not sure what you mean by process? It's pretty easy for example using git submodules - you tell Git the Highway version you want to depend on, notify your CMake or Bazel build system of the hwy library dependency, and that's it.
One requirement is C++11 or later - some people have asked about supporting C but I believe that's infeasible.
Regarding columnar databases, is there an optimal data structure in between storing data as rows and storing data as columns that would make it fast to sort and filter by either dimension? Some mix of B-trees or interval trees?
SIMDe implements the exact semantics of a given ISA using either the native intrinsics, or other platform's intrinsics when possible, otherwise autovectorization. This is great when you've already written code using e.g. NEON intrinsics and want it to run on x86 (similar to neon2sse), or the reverse. It's an impressive undertaking to implement thousands of instructions for each platform :)
Highway instead aims for a path that each platform can implement efficiently. For example, ReorderWidenMulAccumulate bridges differences between NEON and x86 which user code need not care about if they just want a dot product, and CountTrue is efficient on both platforms without requiring NEON to go to the extra trouble of matching the exact x86 movmskb semantics.
Also, Highway uses only intrinsics and does not rely on autovectorization (+), which makes for more predictable performance.
+ except in the EMU128 target, which is only used if SIMD/intrinsics are disabled
I started reading through Highway docs, and maybe I missed something, but I’m surprised it’s a built library to be linked against, instead of a header-only library —- at least when using static targeting. If using static targeting, wouldn’t function call overhead be severe, or are a lot of implementations in the headers?
Good question. There are only a few functions implemented in .cc files, tagged with HWY_DLLEXPORT, notably memory allocation and detecting x86 CPU capabilities. If it were necessary, we could likely strip those out or do a header-only library. The ops/intrinsics called from user code are inlined in headers.
In short, it turns out not to help for single core with vectors, but a few initial passes of ips4o (with 64..256-way partitioning) is faster for parallel sorts.
Depends on your goals?
If you'd like to contribute code under Apache2, I'd be happy to collaborate.
Feel free to reach out to the email in the Highway README.
Some of the things on my short to medium-term todo list include:
- CompressStore for 8-bit lanes (before Icelake, that will likely require splitting up into runs of 8 bytes with one PSHUFB + store each)
- also ExpandLoad (analogous to CompressStore) and LeadingZeroCount
- the C++ STL isn't autovectorized much (http://0x80.pl/notesen/2021-01-18-autovectorization-gcc-clan...). Manual vectorization can help, we have some of the simpler algorithms already in hwy/contrib/algo, others such as unique, reverse, minmax_element, all_of are more interesting.
I have no idea what you just wrote. I don't speak C, know basically nothing about it, I'm not in the business and I'm not ever going to do either, because I don't like it.
So I looked that up. You're talking about intrinsics. I have no idea about that, beyond "Why would people rather remember this instead of just learning the proper assembly code, which is far easier to read and remember anyway?"
I don't write code for compilers to do the work I am supposed to be doing, so I have no idea of the world you code in.
Your response doesn't really answer my question, but that's totally my fault, so ...
Hand me a box with a defined problem, some compiled code I can run to benchmark my own solution against, then wait for me to beat your solution within defined parameters.
My goal is to have some challenges that need optimization, but there needs to be a point to it. I can take any problem from the web, rethink the approach and rewrite it in assembly, but in 99.999% of all cases there's no point to it and I have nothing to compare it against on my own machine anyway.
I do not know what's generally useful to optimize and I certainly do not know who actually cares about that anymore.
Like, for example, how the pattern matching/string search function in freepascal is damn slow. I've researched the topic and noticed that all the solutions I've found are basically crap, so I wrote my own, beating it by some wide margin in terms of codesize and performance. I don't even know why these devs accept subpar solutions, but responses I got to that question were, basically, "It's fast enough" ... which is dumb.
Still, I don't actually have a way of fairly comparing my work.
Regardless, apparently the fastest fixed size pattern/string search algorithms I've found are all crap, but I really have no way of creating a fair comparison. Oh, on that note ... you don't, by chance, have a pattern matching benchmark for fixed sized needles I could compare my own against?
Sorry for being a mess. :D I really just wanna work on something that's actually a challenge AND something people actually need to run more efficiently, just like the QuickSort example.
Thanks :) If I understand correctly, you have tuples of (number, T*) and want to sort by number? That can be done by storing these tuples interleaved, reinterpret_casting to hwy::uint128_t (making sure that the pointer bits are in the least-significant half), and sorting those.
Sort of tangentially related, what sort of tools are you using for disassembly in 2022? Is there anything better than objdump today? What I really want is a version of godbolt that reads compile_commands.json and can show me the godbolt-style color-coded disassembly for an entire binary, any function. I find that when I’m asking these sort of questions I’m wasting a lot of time fitting my problem into godbolt so I can just get a rough idea of what’s coming out, because it’s so much faster (maybe 10x) to read the color tagged version from godbolt than it is to do the same with objdump.
Edit: along the same lines, what can people do (or rather, what do you do) about situations like this: https://github.com/google/highway/blob/master/hwy/examples/b... where things might get better in the future but for now we have to implement it another way? How do you avoid regressions and how do you track workarounds? I want some sort of database that tries competing implementations on every build and emits a compiler warning when things get better. How are you dealing with this?