| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jamesaguilar 4845 days ago
	I've heard that there are c libraries for things like SSE2. I assume the same is true of AVX2. If this is so, why do you write so much of x264 in assembly? Do you find that there are significant gains versus c-code that uses SIMD libraries? Have I been misled that C is nearly as fast as assembly 99% of the time? Note: I'm not trying to question your engineering chops, just trying to correct my own misconceptions.

2 comments

DarkShikari 4845 days ago

"C libraries for things like SSE2"? Do you mean math libraries that have SIMD implementations of various functions that are callable from C? This here is effectively writing those libraries; they don't exist until we write the code.

link

jamesaguilar 4845 days ago

I'm talking about something like this: http://sseplus.sourceforge.net/fntable.html

I'm not an SIMD expert, but it seems like this implements similar primitives to those that are available to assembly (and not C). My question is basically whether the algorithms you're talking about could be implemented with these primitives. Although I guess no such library yet exists for AVX2.

link

DarkShikari 4845 days ago

Intrinsics aren't really C; they work in a C-like syntax, but you're still doing the exact same thing as assembly: you still have to write out every instruction you want to use, so you're not really saving any effort compared to just skipping the middleman.

In return, you are stuck with an extremely ugly syntax and a much less functional preprocessor, with the added bonus of a compiler that mangles your code.

link

Scaevolus 4845 days ago

In terms of mangling, it reorders your vector operations, which can drastically hurt performance.

Do any production compilers schedule instructions to maximize superscalar performance?

link

kevinnk 4845 days ago

Um, unless I misunderstand your question, virtually all of them do. In particular, GCC, Clang/LLVM and ICC all do instruction scheduling.

link

jedbrown 4845 days ago

With intrinsics, you don't have to think about register naming. You still might count registers to avoid spills (and check the assembly to make sure), but there is less of a mental context switch than writing straight assembly.

link

DarkShikari 4845 days ago

I almost never spend more than a few seconds considering register allocation/naming when writing assembly (part of this is because x264's abstraction layer lets macros swap their arguments, so you don't have to track "what happens to be in xmm0 right now" mentally). In some rare cases it can get tricky when you start pushing up against the register cap, but that's exactly the case where the compiler tends to do terribly, and you'd want to do it yourself.

The pain of not having a proper macro assembler in C intrinsics is orders of magnitude worse than having to do my own register allocation in yasm, so for now, yasm is the lesser of two evils.

link

nitrogen 4845 days ago

Is there any hope of a compiler ever coming close to the level of optimization you can get from hand-coded assembly language? The numbers in your table routinely exceeded 10x gains over straight C. What's the compiler doing that's taking so long? Is it not able to vectorize at all?

link

_ihaque 4845 days ago

(I guess DarkShikari's comment is nested too deeply for me to reply directly.)

In my (admittedly limited) experience [1], the compiler has actually done pretty decently at optimizing register allocation in intrinsic-heavy loops. I wrote out the assembly loop in [2] with manual allocation into all 16 XMMs and then noticed the compiler managed to optimize 1 of them out.

[1] https://github.com/simtk/IRMSD

[2] https://github.com/SimTk/IRMSD/blob/master/python/IRMSD/theo...

link

jamesaguilar 4845 days ago

You can always reply directly by clicking on the link above a person's comment. Thanks for the interesting discussion on this one guys.

link

pjmlp 4845 days ago

> I've heard that there are C libraries for things like SSE2...

Those are not C code, rather inline assembly or compiler intrisics, nothing of which has anything to do with C.

link