Hacker News new | ask | show | jobs
by throwawayish 3392 days ago
I think Naples is a very exciting development, because:

- 1S/2S is obviously where the pie is. Few servers are 4S.

- 8 DDR4 channels per socket is twice the memory bandwidth of 2011, and still more than LGA-36712312whateverthenumberwas

- First x86 server platform with SHA1/2 acceleration

- 128 PCIe lanes in a 1S system is unprecedented

All in all Naples seems like a very interesting platform for throughput-intensive applications. Overall it seems that Sun with it's Niagara-approach (massive number of threads, lots of I/O on-chip) was just a few years too early (and likely a few thousands / system to expensive ;)

8 comments

> 128 PCIe lanes in a 1S system is unprecedented

Yes, definitely drooling at this. Assuming a workload that doesn't eat too much CPU, this would make for a relatively cheap and hassle-free non-blocking 8 GPU @ 16x PCIe workstation. I wants one.

That does sound pretty spectacular, and really loud. What kind of case would you put that in? Would you work with ear protection?
Wrt. noise: decibels (dB, "perceived loudness") are logarithmic in sound energy, so going from e.g. a single GTX 1080 at 47 dB to 8x GTX 1080 only increases the noise to 56 dB, which is noticeable but not really annoying, and very far from requiring ear protection. Recommendations for office spaces is that noise be kept < 60 dB IIRC.

Wrt. cases: I think a regular E-ATX compatible case should be enough, but it all depends on the motherboard, and those don't exist yet. Existing 8x GPU servers have been 4U rack mount dual socket affairs; you can also already get 7x GPU dual socket "EEB" motherboards and workstation style cases, but none that will do full 16x for all the GPUs.

Your noise scale is off. 60dB is restaurant conversation level noise.

For comparison, Notebookcheck's system noise scale is 30dB=silent, 40dB=audible, 50dB=loud.

Yup, sorry, my bad. 60 dB is definitely loud enough to be annoying. Still, 8 GPUs are not 8x louder than one, which was the main point.
That's not the point. 80 dB isn't twice as loud as 40 dB, either, it's much more than that.
> Recommendations for office spaces is that noise be kept < 60 dB IIRC.

I'd quit if I had to work in an office space with 60 dB noise. That's like sitting next to a rack of 1U servers at "moderately angry bee swarm" fan level.

I personally cannot stand to be near a noise source above 40 dB for any extended length of time (more than a few hours).

But 60 dB... wow. Can't imagine how shitty that must be to work in for 8 hours per day.

Seriously. Here in my country (Spain) the recommended maximum level for office spaces is 45 dB(A) of equivalent continuous sound level (NBE-CA-88 regulation, only in Spanish: http://www.ual.es/Depar/proyectosingenieria/descargas/Normas...)
> 8 DDR4 channels per socket is twice the memory bandwidth of 2011, and still more than LGA-36712312whateverthenumberwas

This one will be interesting. The current Ryzen (like most of the Intel desktop range) has two channels, but everyone has been benchmarking it against the i7-6900K because they both have eight cores. The i7-6900K is the workstation LGA 2011 with four channels. If the workstation Ryzen will have eight channels...

Let's hope this isn't niagra again: it needs to have decent clock speeds as IPC is still worth something today. But yes, I totally agree, this is an exciting chip.
It's not, not only did AMD move from CMT (clustered multi-thread) design used in the previous Bulldozer microarchitecture, they now have an SMT (simultaneous multithreading) architecture allowing for 2 threads per core.

By comparison, the performance of sparc substantially improved moving from the T1, T2 to T3+. The T1 used a round-robin policy to issue instructions from the next active thread each cycle, supporting up to 8 fine-grained threads in total. That made it more like a barrel processor.

Starting with the T3, two of the threads could be executed simultaneously. Then, starting with the T4, sparc added dynamic threading and out-of-order execution. Later versions are even faster and clock speeds have also risen considerably.

I didn't know about this. Are there benchmarks that aren't canned by Oracle that you know of? I'm intrigued by this round-robin way of threading. I'm not a cpu expert, but how does this compare with the Power arch's way of threading?
Think of it this way, the original Niagara (T1) was an in-order CPU. That is, instructions were executed in the order they occur in the program code. This is simple and power efficient but doesn't produce very good single thread performance, since the processor stalls if an instruction takes longer than expected. Say, a load instruction misses L1 cache and has to fetch the data from L2/L3/Lwhatever/memory. Now, one way to drive up the utilization of the CPU core is to add hardware threads. And the simplest way to do that? Well, just run an instruction from another available thread every cycle (that is, if a thread is blocked e.g. waiting for memory, skip it). So now you have a CPU that is still pretty small, simple and power efficient, but can still exploit memory level parallelism (i.e. have multiple outstanding memory ops in flight).

Now, the other approach, is that you have a CPU with out of order (OoO) execution. Meaning that the CPU contains a scheduler that handles a queue of instructions, and any instruction that has all its dependencies satisfied can be submitted for execution. And then later on a bunch of magic happens so that externally to the CPU it still looks like everything was executed in order like the program code specified. This is pretty good for getting good single thread performance, and can exploit some amount of MLP as well, e.g. if a bunch of instructions are waiting for a memory operation to complete, some other instructions can still proceed (perhaps executing a memory op themselves). So in this model the amount of MLP is limited by the inherent serial dependencies in the code, and on the length of the instruction queues that the scheduler maintains. The downside of this is that the OoO logic takes up quite a bit of chip area (making it more expensive), and also tends to be one of the more power-hungry parts of the chip. But, if you want good single-thread performance, that's the price you have to pay.. Anyway, now that you have this OoO CPU, what about adding hardware threads? Well, now that you already have all this scheduling logic, turns out it's relatively easy. Just "tag" each instruction with a thread ID, and let the scheduler sort it all out. So this is what is called Simultaneous Multi-Threading (SMT). So in a way it's a pretty different way of doing threading compared to the Niagara-style in-order processor. Also, since you already have all this OoO logic that is able to exploit some MLP within each thread, you don't need as many threads as the Niagara-style CPU to saturate the memory subsystem. So, this SMT style of threading is what you see in contemporary Intel x86 processors (they call it hyperthreading (HT)), IBM POWER, and now also AMD Zen cores.

As for benchmarks, I'm too lazy to search, but I'm sure you can find e.g. some speccpu results for Niagara.

Doing that now. Thanks for the write-up.

So although separated by time but not by clocks (the intel setup has the roughly the same base clocks and the same ram as the t4 setup) the 40 thread Xeon system had roughly double the perf of the 128 thread t4 setup running speccjvm2008 https://www.spec.org/jvm2008/results/jvm2008.html

The T7 and S7 are even faster than the T4, and unfortunately I haven't seen newer results published for them.
Naples is based on Ryzen which, if you look at early benchmarks, is beating the competition on all fronts except gaming (suspectedly due to software optimisation and motherboard issues).
yes but four modules of ryzen to make this beastly naples chip isn't going to be clocked at the same frequencies. the top end intel chips have TDPs of 165W but 4 ryzen chips at 3.6ghz have a tdp of 65w a piece and you're not going to see a 260W server chip if you want to sell into the datacenter.
The Ryzen 7 1800X has a TDP of 95W and beats the 140W Intel i7-6900K by 4% in performance. They've made some huge jumps in power efficiency.

I don't know if AMD will make a new architecture or not, but I can't see why they wouldn't just release 32 Ryzen cores side-by-side and underclocked at the stock configuration.

The 1800X will use 130W+ in the same scenarios as the 6900k. AMD just seems to be defining TDP differently.
The thermal design power is the maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate in typical operation. TDP =/= power consumed
> TDP =/= power consumed

Where do you think the heat comes from? Or where do you think the power that doesn't turn into heat goes?

As far as TDP goes, the transferrence of electrical energy to heat is equivalent to that of a space heater as Puget Systems demonstrated here: https://www.pugetsystems.com/labs/articles/Gaming-PC-vs-Spac...
but TDP is also a function of power consumption, directly proportional. So comparing TDP of processors of similar fabrication should tell you about comparative power consumption if not the exact difference.
I've seen benchmarks showing the 95w number is very unrealistic and that it actually uses more like the Intel processor under certain loads
Wow, that's pretty insane. Underclocked to 3.3Ghz it runs at ~42 watts and is benching at ~178.5% performance per watt vs stock clock. This CPU will be very interesting to see in the datacenter space.
They will likely drop the clock. I don't think the market cares at all about how much a CPU takes. If a 1U box has competitive performance AND better performance/watt then it's attractive. It if has worse performance/watt than it's not.

AMD might well steal some of the dual socket market with a dual socket, and maybe some of the quad socket market with dual sockets.

Considering that the current ryzen at $500 is relatively competitive with the $1,000 intel (basically a relabled Xeon with 4 memory busses in the LGA2011 server socket) a quad module (32 core/64 thread) in a socket sounds pretty good. Even if it's more watts than the intel.

The E7's go to 160W each. If you can drive better than 1.7x the performance and stay within the maximum thermal output per physical volume, I see no reason why not to use this.

One reason, perhaps, is if my binaries are compiled with Intel-specific optimizations and it's inconvenient to deploy separate AMD-optimized binaries.

I can see a use case for it, as long as it delivers on performance. No one minds high TDP, as long as it offers a performance advantage. Hell, some servers have 4-8 Titans in them, and no one is complaining about their TDP. If a 260W CPU TDP is justified by the performance, no one will care.
The bigger E7 scratch the 200 W mark pretty hard and IBM already had POWER chips go beyond 200 W. However, cooling and power density are ... problematic. The same goes for accelerators. Supermicro will happily deliver you a 1U box with four pascals and two Xeon sockets, but there is no datacenter in the world were you can stuff 42 of those in a cabinet. [Which doesn't mean that these don't make sense]

However, high end systems don't lend themselves well to mass-deployment (i.e. scale out).

maybe for single server or academia setups, but in datacenters TDP and power consumption absolutely do matter.
And that was my point as well.
I'm more than willing to admit I could be wrong. And maybe Intel will push the TDP envelope with servers as well if Naples proves a threat when things all shake out. Just if Intel hasn't put a ~250W server chip into production I doubt AMD will then again if it performs that much better then there's a calculus there that will need to be done. My prediction, based on no evidence, is that this 32-core chip will be clocked at 2.6ghz and boost to 3.2. Shot in the dark, but given current TDPs that's where I think things might shake out to.
32 cores in one socket may also take a bite out of some servers that are currently 2 sockets.
My shallow understanding of big servers and IBM Z series amounted to "lots of dedicated IO processors". Seems like "mainstream" caught up with big blue.
Sort of. It ebbs and flows, generally more maintainable to do more in CPU/kernel and less in HW/firmware for PCs and of course price runs the market so there's a race to do less. Part of the mainframe price tag is getting long term support on the whole system stack, whereas PC vendors actively abandon stuff after a few years. That is a big risk for something like TCP offload engine.

Every mainframe interface is basically an offload interface.. "computers" DMAing and processing to the CPs and each other. Every I/O device has a command processor, so it can handle channel errors and integrated pcie errors in a way PCs cannot.

A PC with Chelsio NICs doing TCP offload with direct data placement or RDMA as well as Fiber Channel storage would be mini/mainframe-ish.

Pretty much. Mainframes have been very I/O oriented from the start. Channel I/O (more or less DMA) with dedicated channel programs and processors can be very high-throughput.
Also I suppose it frees the logic processors from all IO (caching too?) related processing and allow for fancier strategies downstream .. (all guess fest)
Intel doesn't have SHA2 acceleration? ARMv8 has had it for like 2-3 years now...

And AMD should dump SHA1 acceleration in the next generation.

>And AMD should dump SHA1 acceleration in the next generation.

The cost to have that on silicon is probably close to zero. If you think SHA1 is just going to magically disappear because you want it to, well, you'll be in for a SHA1 sized surprise. Our grandkids will still have SHA1 acceleration.

>ARMv8 has had it for like 2-3 years now...

Because ARM cores don't remotely have the CPU heft an Intel x86/64 chip has, so ARM needs all this acceleration because its typically used in very low power mobile scenarios. On top of that, Intel claims AES-NI can be used to accelerate SHA1.

https://software.intel.com/en-us/articles/improving-the-perf...

Why should it be dropped ? Isn't it just a hash function ?
If you remove things from the instruction set, any code that uses them will either crash or run very slowly in emulation.

Most uses of special instructions will check feature bits or CPU version, but not all will do so correctly.

(I'd say that the additional area cost of something like this is small, and the big cost of special instructions is reserving opcodes and feature bits)

Short story: because its role as a crypto hash function is sort of obsolete given that it's been proven to be broken, and faster, more secure alternatives exist.

But for all practical purposes, SHA1 isn't about to disappear. MD5 has been shown to be broken since forever and people still write new code using it today.

The thing with SHA-1 is that we know (and have known for a decade) that is not a good cryptographic hash function. It is still, along with MD5, a good hash function if you control the input, i.e. in a hash table.
There are better functions than SHA1 to use for hash tables. Candid question: really what is the use for MD5/SHA1 these days?
ARM cores are much weaker, crypto performance without NEON is absymal across the board. Of course, compared to hardware-acceleration software always seems slow; Haswell manages AES-OCB at <1 cpb.
As a side note, XOP had rotate instructions. Sadly it is no longer supported in Ryzen.
Intel hass had SHA1/2 acceleration for YEARS via the AES-NI instruction set.

https://en.wikipedia.org/wiki/Intel_SHA_extensions

>There are seven new SSE-based instructions, four supporting SHA-1 and three for SHA-256:

>SHA1RNDS4, SHA1NEXTE, SHA1MSG1, SHA1MSG2, SHA256RNDS2, SHA256MSG1, SHA256MSG2

This is not part of AES-NI and has never been released in a mid-range+ server/desktop CPU, only part of some Atom parts (Goldmont). Therefore software support is poor (I think OpenSSL does not support it). It is said to be included in 2018+ Cannonlake, though.
haha nope. This is not a part of AES-NI.

The only processors so far with these extensions are low power Goldmont chips.

https://github.com/weidai11/cryptopp/issues/139

Goldmont probably has them because it doesn't have the wide AVX pipelines necessary for fast software crypto.

Skylake can compute SHA1 at 4.3-3.4 cycles/B and SHA256 at 7-9 cycles/B [1]. That's ~1GB/s SHA1 and ~500MB/s SHA256.

1: https://bench.cr.yp.to/results-hash.html#amd64-skylake