| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kierank 857 days ago

v210_planar_pack_8_c: 2298.5

v210_planar_pack_8_ssse3: 402.5

v210_planar_pack_8_avx: 413.0

v210_planar_pack_8_avx2: 206.0

v210_planar_pack_8_avx512: 193.0

v210_planar_pack_8_avx512icl: 100.0

23x speedup. The compiler isn't going to come up with some of the trickery to make this function 23x faster.

800% is nothing.

1 comments

Const-me 857 days ago

You don’t need assembly to leverage AVX2 or AVX512 because on mainstream platforms, all modern compilers support SIMD intrinsics.

Based on the performance numbers, whoever was writing that test neglected to implement manual vectorization for the C version. Which is the only reason why assembly is 23x faster for that test. If they rework their C version with the focus on performance i.e. using SIMD intrinsics, pretty sure the performance difference between C and assembly versions gonna be very unimpressive, like couple percent.

link

astrange 856 days ago

The C version is in C because it needs to be portable and so there can be a baseline to find bugs in the other implementations.

The other ones aren't in asm merely because video codecs are "performance sensitive", it's because they're run in such specific contexts that optimizations work that can't be expressed portably in C+intrinsics across the supported compilers and OSes.

link

Const-me 856 days ago

Yeah, it’s clear why you can’t have a single optimized C version.

However, can’t you have 5 different non-portable optimized C versions, just like you do with the assembly code?

SIMD intrinsics are generally portable across compilers and OSes, because their C API is defined by Intel, not by compiler or OS vendors. When I want software optimized for multiple targets like SSE, AVX1, AVX2, I sometimes do that in C++.

link