| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Osiris 4801 days ago
	Are there binary builds available with AVX2 support compiled in for testing? I'm curious if FMA(3/4) support available in AMD processors would increase performance. A quick Google search shows that there are some patches available for FMA support.

2 comments

DarkShikari 4801 days ago

I only pushed the code a few minutes ago, but binaries should probably be up at http://x264.nl/ relatively soonish (it's not my site though, so I wouldn't know exactly).

If you want to test without a physical Haswell, the Intel Software Development Emulator should work okay, albeit somewhat slowly. I'd post overall numbers for real Haswells, but Intel has apparently said we can't do that yet.

Regarding FMA, FMA3/4 are floating point only. Since x264 has just one floating point assembly function, only two FMA3/FMA4 instructions get used in all of x264 (not counting duplicates from different-architecture versions of the function). An FMA4 version has been included for a while; the new AVX2 version does include FMA3, but of course that won't run on AMD CPUs (yet).

XOP had some integer FMA instructions, but I generally didn't find them that useful (there's a few places I found they could be slotted in, though).

link

jamesaguilar 4801 days ago

I've heard that there are c libraries for things like SSE2. I assume the same is true of AVX2. If this is so, why do you write so much of x264 in assembly? Do you find that there are significant gains versus c-code that uses SIMD libraries? Have I been misled that C is nearly as fast as assembly 99% of the time?

Note: I'm not trying to question your engineering chops, just trying to correct my own misconceptions.

link

DarkShikari 4801 days ago

"C libraries for things like SSE2"? Do you mean math libraries that have SIMD implementations of various functions that are callable from C? This here is effectively writing those libraries; they don't exist until we write the code.

link

jamesaguilar 4801 days ago

I'm talking about something like this: http://sseplus.sourceforge.net/fntable.html

I'm not an SIMD expert, but it seems like this implements similar primitives to those that are available to assembly (and not C). My question is basically whether the algorithms you're talking about could be implemented with these primitives. Although I guess no such library yet exists for AVX2.

link

DarkShikari 4801 days ago

Intrinsics aren't really C; they work in a C-like syntax, but you're still doing the exact same thing as assembly: you still have to write out every instruction you want to use, so you're not really saving any effort compared to just skipping the middleman.

In return, you are stuck with an extremely ugly syntax and a much less functional preprocessor, with the added bonus of a compiler that mangles your code.

link

Scaevolus 4801 days ago

In terms of mangling, it reorders your vector operations, which can drastically hurt performance.

Do any production compilers schedule instructions to maximize superscalar performance?

link

jedbrown 4801 days ago

With intrinsics, you don't have to think about register naming. You still might count registers to avoid spills (and check the assembly to make sure), but there is less of a mental context switch than writing straight assembly.

link

pjmlp 4801 days ago

> I've heard that there are C libraries for things like SSE2...

Those are not C code, rather inline assembly or compiler intrisics, nothing of which has anything to do with C.

link

ajross 4801 days ago

You want to test software on a device that doesn't exist in the market yet, but you don't want to build it yourself? The time you'll spend figuring out whatever emulator you're going to use is far longer than the time it takes to build x264 from a developer branch...

link

DarkShikari 4801 days ago

In all fairness, Intel's emulator is incredibly easy to use; it literally works like this:

sde -- ./myprogram myargs

instead of

./myprogram myargs

There's also probably a decent number of people at this point who have prerelease CPUs; they tend to breed quite explosively in the month or two before the official release.

link