| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by reitzensteinm 3616 days ago

The lowest price Xeon Phi in this generation is $2,348 (1.3ghz, 64 cores) - I can't help but feel Intel would do well to introduce an enthusiast product in to the lineup. Even 1.0ghz, 48 cores for $1000.

They're Tesla priced without an equivalent desktop gamer graphics card, and that means you can't just dip your toe into the water; you've got to buy the canoe up front.

Programming on a normal x86 doesn't really count, because there's no way to get a feel for what is fast and slow when you're using a monster of a core capable of running your poor code more quickly than it deserves.

6 comments

nkurz 3616 days ago

I agree with you completely. One other thing that I think Intel could/should do is to cooperate with one of the major cloud providers to offer reasonably priced by-the-hour remote access.

There is one wonderful opportunity, though, that deserves to be better known. Intel has sponsored Colfax Research to offer free online introductory courses, which include two weeks of remote access. The next session begins August 29th: http://colfaxresearch.com/how-16-08/

(I'm unaffiliated, but enjoyed the course a few months ago.)

dibanez 3616 days ago

As someone who has programmed both Phis and conventional x86 CPUs, I can confirm that the Phi is more sensitive to data traversal order and NUMA effects on which core accesses which memory. Also, the latest generation (Knights Landing) has much better performing cores than the previous generation.

reitzensteinm 3616 days ago

Well, if it weren't the case and you had that core count without compromise, Phis would have come along a lot sooner with a price tag to match :)

Did you happen to use the Knight's Corner or the new Knight's Landing variant? I'd be quite interested to know how KL stacks up, as naively from the specs it seems like it should be a lot more tolerant with code (but not poor memory access patterns).

dibanez 3616 days ago

Both, and it agrees with your prediction. KNL's cores are each much faster than KNC's cores. A KNC core was over 10X slower than a mainstream CPU core, and a KNL core seems to only be about 4.5X slower (on my particular code). I also get linear OpenMP scaling from 1 to 64 threads on KNL, so the parallelism is all there.

rjtobin 3616 days ago

Some questions out of curiosity: Is your application bandwidth-bound / compute-bound or something else? Also what modes have you been operating the KNL chip in?

dman 3616 days ago

Are you using the socketed version of the pcie version?

dman 3616 days ago

I agree - but I think there appears to be a more significant shift underpinning this. I suspect that we are beginning to see an architectural divergence between server and client.

This reverses the last 20 years where intel made inroads into the datacenter and there were few fundamental differences between xeons and their desktop brethren (the i5/i7 etc). Intel will have vastly different ISAs on server and client this coming generation (desktop is not getting AVX512). I suspect the storage layer to get bifurcated as well, since its unclear if clients will see much benefit from things like xpoint. In short client side the only tangible gains that seem to benefit off late are - will the hardware change improve battery life, will it enable thinner form factors and will it make a browser run measurably faster. I watch with great interest how Intel will push adoption of hardware features going forward on the client.

Klinky 3616 days ago

It's is rather silly for Phi to be positioned as "it's just like x86, oh wait except for needing to use special SIMD instructions to get max performance". Kind of like Atom being x86 for ultra mobile platforms, just not being able to match the power/performance of ARM. Once you start sacrificing things to maintain x86 compatibility, you really loose its benefits.

slededit 3614 days ago

You really do need these changes to get max parallelism though. Where it shines is situations where you'd otherwise be porting to a GPU. On the Phi its a recompile and adding a few intrinsics to your inner loops. This is much faster than getting reasonable performance on a heterogeneous architecture and you don't have to micro-manage the slow PCIe link between the CPU and the GPU.

yvdriess 3615 days ago

Programming a modern Xeon x86 does count. Modernising your software for a haswell/skylake server Xeon (ISA was made public) also modernises it for the new Xeon Phi. You have a nearly identical ISA and programming model. In other words, modernising your code to scale well on a 16C/2P Xeon system is essentially dipping your toe for a full blown KNL Xeon Phi.

PS. For pricing, take into account that the new generation Xeon Phi's are bootable, you do not need a host CPU babysit like Tesla's case.

lqdc13 3616 days ago

Why not just use Xeon 2697-v2 for the same price as the phi?

It's 12 core so performance in all-core situation would be about the same as this one. But on non-parallelized code it would be ~5x faster..

profquail 3616 days ago

Memory bandwidth is important too. The Knights Landing processors have a 16GB on-chip memory to the cores have significantly higher bandwidth than you'd get with DDR4; the additional memory bandwidth makes more of an impact on the runtime of some algorithms than raw compute performance does.

fulafel 3616 days ago

The optional 16GB L3 is on separate chips, but it's colocated inside the same chip package. This kind of MCMs (multi-chip modules) have been used for a long time in the semiconductor industry since the 70s. Recent examples include AMD Xenos in XBox 360, Wii U CPU, IBM POWER chips.

yvdriess 3615 days ago

Nope. First, it's not L3 cache and secondly comparing 3D-stacked in-package memory (MCDRAM, HBM, HBM2) with your examples is misleading. https://en.wikipedia.org/wiki/High_Bandwidth_Memory

fulafel 3615 days ago

You can configure the near memory to be used as cache or directly addressed memory as desired. Users of existing codes will configure it as cache.

yvdriess 3608 days ago

Direct addressing is the preferred configuration. Only if your existing code's working set does not fit in MCDRAM does the cache configuration make sense.

It might sound pedantic on my part, but 'it can act as cache' is very different in practice from 'It is a cache'.

segmondy 3616 days ago

Last year they sold a bunch of them 60 cores for $200. I got one, the problem is that they run hot and need a server that can support them. I'm yet to acquire a server with bar support, so it's still sitting. :-( Anyways, they are out there for decent price, keep your eyes open and you will find a deal.

nkurz 3616 days ago

More than just "run hot", they were the "passive" models that require external cooling. You might be interested in these 3D printed designs for the cooling:

http://www.thingiverse.com/thing:997213

http://ssrb.github.io/hpc/2015/04/17/cooling-down-the-xeon-p...

As you mention, you'd still need a motherboard with 64-bit Base Address Register support, but at least you could keep it from burning up (or more likely, shutting down when it overheats).

segmondy 3615 days ago

Nice, thanks, I'll check those out. My plan was to run it only during the winter months outside, with massive fans.