Hacker News new | ask | show | jobs
by gigatexal 3396 days ago
Let's hope this isn't niagra again: it needs to have decent clock speeds as IPC is still worth something today. But yes, I totally agree, this is an exciting chip.
2 comments

It's not, not only did AMD move from CMT (clustered multi-thread) design used in the previous Bulldozer microarchitecture, they now have an SMT (simultaneous multithreading) architecture allowing for 2 threads per core.

By comparison, the performance of sparc substantially improved moving from the T1, T2 to T3+. The T1 used a round-robin policy to issue instructions from the next active thread each cycle, supporting up to 8 fine-grained threads in total. That made it more like a barrel processor.

Starting with the T3, two of the threads could be executed simultaneously. Then, starting with the T4, sparc added dynamic threading and out-of-order execution. Later versions are even faster and clock speeds have also risen considerably.

I didn't know about this. Are there benchmarks that aren't canned by Oracle that you know of? I'm intrigued by this round-robin way of threading. I'm not a cpu expert, but how does this compare with the Power arch's way of threading?
Think of it this way, the original Niagara (T1) was an in-order CPU. That is, instructions were executed in the order they occur in the program code. This is simple and power efficient but doesn't produce very good single thread performance, since the processor stalls if an instruction takes longer than expected. Say, a load instruction misses L1 cache and has to fetch the data from L2/L3/Lwhatever/memory. Now, one way to drive up the utilization of the CPU core is to add hardware threads. And the simplest way to do that? Well, just run an instruction from another available thread every cycle (that is, if a thread is blocked e.g. waiting for memory, skip it). So now you have a CPU that is still pretty small, simple and power efficient, but can still exploit memory level parallelism (i.e. have multiple outstanding memory ops in flight).

Now, the other approach, is that you have a CPU with out of order (OoO) execution. Meaning that the CPU contains a scheduler that handles a queue of instructions, and any instruction that has all its dependencies satisfied can be submitted for execution. And then later on a bunch of magic happens so that externally to the CPU it still looks like everything was executed in order like the program code specified. This is pretty good for getting good single thread performance, and can exploit some amount of MLP as well, e.g. if a bunch of instructions are waiting for a memory operation to complete, some other instructions can still proceed (perhaps executing a memory op themselves). So in this model the amount of MLP is limited by the inherent serial dependencies in the code, and on the length of the instruction queues that the scheduler maintains. The downside of this is that the OoO logic takes up quite a bit of chip area (making it more expensive), and also tends to be one of the more power-hungry parts of the chip. But, if you want good single-thread performance, that's the price you have to pay.. Anyway, now that you have this OoO CPU, what about adding hardware threads? Well, now that you already have all this scheduling logic, turns out it's relatively easy. Just "tag" each instruction with a thread ID, and let the scheduler sort it all out. So this is what is called Simultaneous Multi-Threading (SMT). So in a way it's a pretty different way of doing threading compared to the Niagara-style in-order processor. Also, since you already have all this OoO logic that is able to exploit some MLP within each thread, you don't need as many threads as the Niagara-style CPU to saturate the memory subsystem. So, this SMT style of threading is what you see in contemporary Intel x86 processors (they call it hyperthreading (HT)), IBM POWER, and now also AMD Zen cores.

As for benchmarks, I'm too lazy to search, but I'm sure you can find e.g. some speccpu results for Niagara.

Doing that now. Thanks for the write-up.

So although separated by time but not by clocks (the intel setup has the roughly the same base clocks and the same ram as the t4 setup) the 40 thread Xeon system had roughly double the perf of the 128 thread t4 setup running speccjvm2008 https://www.spec.org/jvm2008/results/jvm2008.html

The T7 and S7 are even faster than the T4, and unfortunately I haven't seen newer results published for them.
Naples is based on Ryzen which, if you look at early benchmarks, is beating the competition on all fronts except gaming (suspectedly due to software optimisation and motherboard issues).
yes but four modules of ryzen to make this beastly naples chip isn't going to be clocked at the same frequencies. the top end intel chips have TDPs of 165W but 4 ryzen chips at 3.6ghz have a tdp of 65w a piece and you're not going to see a 260W server chip if you want to sell into the datacenter.
The Ryzen 7 1800X has a TDP of 95W and beats the 140W Intel i7-6900K by 4% in performance. They've made some huge jumps in power efficiency.

I don't know if AMD will make a new architecture or not, but I can't see why they wouldn't just release 32 Ryzen cores side-by-side and underclocked at the stock configuration.

The 1800X will use 130W+ in the same scenarios as the 6900k. AMD just seems to be defining TDP differently.
The thermal design power is the maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate in typical operation. TDP =/= power consumed
> TDP =/= power consumed

Where do you think the heat comes from? Or where do you think the power that doesn't turn into heat goes?

I think he is trying to argue that TDP is a figure about the cooling requirement during peak power usage. Actual power usage may or may not be less during typical workload.
As far as TDP goes, the transferrence of electrical energy to heat is equivalent to that of a space heater as Puget Systems demonstrated here: https://www.pugetsystems.com/labs/articles/Gaming-PC-vs-Spac...
but TDP is also a function of power consumption, directly proportional. So comparing TDP of processors of similar fabrication should tell you about comparative power consumption if not the exact difference.
I've seen benchmarks showing the 95w number is very unrealistic and that it actually uses more like the Intel processor under certain loads
Wow, that's pretty insane. Underclocked to 3.3Ghz it runs at ~42 watts and is benching at ~178.5% performance per watt vs stock clock. This CPU will be very interesting to see in the datacenter space.
They will likely drop the clock. I don't think the market cares at all about how much a CPU takes. If a 1U box has competitive performance AND better performance/watt then it's attractive. It if has worse performance/watt than it's not.

AMD might well steal some of the dual socket market with a dual socket, and maybe some of the quad socket market with dual sockets.

Considering that the current ryzen at $500 is relatively competitive with the $1,000 intel (basically a relabled Xeon with 4 memory busses in the LGA2011 server socket) a quad module (32 core/64 thread) in a socket sounds pretty good. Even if it's more watts than the intel.

The E7's go to 160W each. If you can drive better than 1.7x the performance and stay within the maximum thermal output per physical volume, I see no reason why not to use this.

One reason, perhaps, is if my binaries are compiled with Intel-specific optimizations and it's inconvenient to deploy separate AMD-optimized binaries.

I can see a use case for it, as long as it delivers on performance. No one minds high TDP, as long as it offers a performance advantage. Hell, some servers have 4-8 Titans in them, and no one is complaining about their TDP. If a 260W CPU TDP is justified by the performance, no one will care.
The bigger E7 scratch the 200 W mark pretty hard and IBM already had POWER chips go beyond 200 W. However, cooling and power density are ... problematic. The same goes for accelerators. Supermicro will happily deliver you a 1U box with four pascals and two Xeon sockets, but there is no datacenter in the world were you can stuff 42 of those in a cabinet. [Which doesn't mean that these don't make sense]

However, high end systems don't lend themselves well to mass-deployment (i.e. scale out).

maybe for single server or academia setups, but in datacenters TDP and power consumption absolutely do matter.
And that was my point as well.
I'm more than willing to admit I could be wrong. And maybe Intel will push the TDP envelope with servers as well if Naples proves a threat when things all shake out. Just if Intel hasn't put a ~250W server chip into production I doubt AMD will then again if it performs that much better then there's a calculus there that will need to be done. My prediction, based on no evidence, is that this 32-core chip will be clocked at 2.6ghz and boost to 3.2. Shot in the dark, but given current TDPs that's where I think things might shake out to.