| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by neonsunset 717 days ago

If you like SIMD and would like to dabble in it, I can strongly recommend trying it out in C# via its platform-agnostic SIMD abstraction. It is very accessible especially if you already know a little bit of C or C++, and compiles to very competent codegen for AdvSimd, SSE2/4.2/AVX1/2/AVX512, WASM's Packed SIMD and, in .NET 9, SVE1/2:

https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

Here's an example of "checked" sum over a span of integers that uses platform-specific vector width:

https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Other examples:

CRC64 https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Hamming distance https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Default syntax is a bit ugly in my opinion, but it can be significantly improved with helper methods like here where the code is a port of simdutf's UTF-8 code point counting: https://github.com/U8String/U8String/blob/main/Sources/U8Str...

There are more advanced scenarios. Bepuphysics2 engine heavily leverages SIMD to perform as fast as PhysX's CPU back-end: https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...

Note that practically none of these need to reach out to platform-specific intrinsics (except for replacing movemask emulation with efficient ARM64 alternative) and use the same path for all platforms, varied by vector width rather than specific ISA.

2 comments

runevault 717 days ago

Funny you mention c#, I started to look at this and I made the mistake of wanting to do string comparison via SIMD, except you can't do it externally because it relies on private internals (note, the built in comparison for c# already does SIMD, you just can't easily reimplement it against the built in string type).

link

neonsunset 717 days ago

What kind of private internals do you have in mind? You absolutely can hand-roll your own comparison routine, just hard to beat existing implementation esp. once you start considering culture-sensitive comparison (which may defer to e.g. ICU).

There are no private SIMD APIs save for sequence comparison intrisic for unrolling against known lengths which JIT/ILC does for spans and strings.

link

runevault 717 days ago

IIRC (Been a month or so since I looked into it) I couldn't access the underlying array in a way SIMD liked I think? If you look at how they did it inside the actual string class it uses those private properties of the string that are only available internally to guarantee you don't change the string data if memory serves.

link

neonsunset 717 days ago

String can provide you a `ReadOnlySpan<char>`, out of which you can either take `ref readonly char` "byref" pointer, which all vectors work with, or you can use the unsafe variant and make this byref mutable (just don't write to it) with `Unsafe.AsRef`.

Because pretty much every type that has linear memory can be represented as span, it means that every span is amenable to pointer (byref) arithmetics which you then use to write a SIMD routine. e.g.:

    var text = "Hello, World! Hello, World!";
    var span = MemoryMarshal.Cast<char, ushort>(text);
    ref readonly var ptr = ref span[0];

    var chunk = Vector128.LoadUnsafe(in ptr);
    var needle = Vector128.Create((ushort)',');
    var comparison = Vector128.Equals(chunk, needle);
    var offset = uint.TrailingZeroCount(comparison.ExtractMostSignificantBits());

    Console.WriteLine(text[..(int)offset]);

If you have doubts regarding codegen quality, take a look at: https://godbolt.org/z/b97zjfTP7 The above vector API calls are lowered to lines 17-22.

link

runevault 717 days ago

Oh interesting, I'll have to give that a try then. My concern was avoiding a reallocation by doing it another way, but if the readonly span works I can see how it would get you there. I need to see if I still have that project to test it out, appreciate the heads up. SIMD is something I really want to get better with.

link

neonsunset 717 days ago

If you go through the guide at the first link, it will pretty much set you up with the basics to work on vectorization, and once done, you can look at what CoreLib does as a reference (just keep in mind it tries to squeeze all the performance for short lengths too, so the tail/head scalar handlers and dispatch can be high-effort, more so than you may care about). The point behind the way .NET does it is to have the same API exposed to external consumers as the one CoreLib uses itself, which is why I was surprised by your initial statement.

No offense taken, just clarifying, SIMD can seem daunting especially if you look at intrinsics in C/C++, and I hope the approach in C# will popularize it. Good luck with your experiments!

link

zvrba 716 days ago

I implemented a sorting network in C# with AVX2 intrinsics. https://github.com/zvrba/SortingNetworks

link

neonsunset 716 days ago

It's a nice piece of work! If you're interested, .NET's compiler has improved significantly since 3.1, in particular, around structs and pre-existing intrinsics (which are no longer needed to be used directly in most situations - pretty much all code prefers to use plain methods on VectorXXX<T> whenever possible). Also note the use of AggressiveOptimization attribute which disables tiered compilation and forces the static initialization checks your readme refers to - removing AO allows the compiler to bake statics directly into codegen through tiered compilation as upon reaching Tier 1 the value of such readonly statics will be known. For trivially constructed values, it is better to not store such in fields but rather construct them in place via e.g. expression-bodied properties like 'Vector128<byte> MASK => Vector128.Create((byte)0x80)`. I don't remember exactly whether this was introduced in Core 3.1 or 5, but today the use of `AggressiveOptimization` flag is discouraged unless you do need to bypass DynamicPGO.

You also noted the lack of ability to express numeric properties of T within generic context. This was indeed true, and this limitation was eventually addressed by generic math feature. There are INumber<T>, IBinaryInteger<T> and others to constrain the T on, which bring the comparison operators you were looking for.

In general, the knowledge around vectorized code has substantially improved within the community, and it is used quite more liberally nowadays by those who are aware of it.

link