| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by onenine 5365 days ago
	This article is impressively bad. While the 8-module chip does share a few things (mainly a vector processing unit, that becomes two when doing the 128-bit SSE operations) they really can run 16 threads on 16 ALUs. But, they'll have sse contention if they schedule more than 8 256-bit vector operations (sadly intel won't bring this instruction set to market for a bit). Bulldozer is pretty cool, but sadly the tech press decides to shit on the underdog in a market that multiple companies have successfully sued the monopolist for anti-competitive behavior. :(

2 comments

DrPizza 5365 days ago

> This article is impressively bad.

Cheers!

> While the 8-module chip does share a few things (mainly a vector processing unit, that becomes two when doing the 128-bit SSE operations)

A few things? No, it shares a lot of things. The entire floating point and SIMD unit. The entire front-end. The branch predictor is, I believe, a weird hybrid of shared and non-shared. The I-cache, and the L2 cache, also both shared.

The front-end is particularly troublesome. The entire decoder can either service one thread or the other. If both threads need instructions, the best it can do is round robin between them. This averages to allow just two instructions per cycle: less decode bandwidth than K10.

Likewise the integer units: there are fewer ALUs and AGUs per thread than in K10. Likewise the floating point unit. There's lots of sharing, and even the private, non-shared parts are resource-starved.

> But, they'll have sse contention if they schedule more than 8 256-bit vector operations (sadly intel won't bring this instruction set to market for a bit).

SSE contention will occur if a thread can issue more than two SSE operations per cycle, or one AVX operation per cycle.

> Bulldozer is pretty cool, but sadly the tech press decides to shit on the underdog in a market that multiple companies have successfully sued the monopolist for anti-competitive behavior. :(

I don't care about "the underdog" or which is "cool" or which multi-billion dollar corporation you might prefer. I care about which works better. It ain't Bulldozer.

dman 5365 days ago

Benchmarks at http://www.phoronix.com/scan.php?page=article&item=amd_f... appear to paint a picture of a much more balanced performance profile for bulldozer chips. It does well in threaded applications and where code is recompiled for it.

I cringed a bit when I saw this on arstechnica - the linkbaitey headline, the image of a burning bulldozer, the lack of any benchmarks that you ran yourself and the fact that data is presented in a lopsided fashion. Here are a few examples -

a) If you look at the actual prices for the Xeon system and the AMD system you can see that the price of the system is entirely dominated by the cost of the SSD drive. Of the ~1.5 Million in before discount price nearly 1.2 Million is for the SSD in the AMD system. While in the Xeon system 485k of the roughly 740k price is the SSD. Penalising AMD for that seems unfair. Also it remains unclear what the SSD in the AMD at double the cost of the Xeon SSD does for performance. b) In the SPEC JBB2005 section where the bulldozer 6200 scores 1.25 million bops, the 6100 gets 0.981 million, and the Xeon has 0.975 million you explain away the high performance saying that this exists only because of a higher number of cores. c) For the SAP section - "the 6200 scores 31,720 SAPS, the 6100 scores 24,020, and the Xeon gets 28,480. The 6200 system, with 33 percent more processors than the 6100 system, gets 32 percent more performance." Heres a test that clearly contradicts your Bulldozer is absymal narrative. d) In the end you write - "AMD is boasting that Opteron 6200 is the "first and only" 16-core x86 processor on the market. Not only is this not really true (equating threads and cores is playing fast and loose with the truth), it just doesn't matter. " - except in the SPEC JBB2005 test where you yourself said that "But these results are still cause for some concern. The 6200 part has 33 percent more cores than the 6100 part, as well as a minor clock speed advantage. Its performance in this CPU-stressing benchmark is only 27 percent greater than that of the 6100. " e) Next time please run some benchmarks of your own.

DrPizza 5365 days ago

The Phoronix benchmarks, like most others, suggest that the only area where Bulldozer appears at all competent is HPC. To describe this as niche is an understatement.

a) I agree it remains unclear how much difference the SSD makes. That's why I don't think it's a useful demonstration of Bulldozer's performance _even though AMD is citing it as such_. b) Yes, I do. That 1/3 more cores gives 1/3 more performance in a test that scales almost perfectly means that the per-core performance has stood still. A 32 nm K10.5 chip with 1/3 more cores would perform just as well, cost less to build, use less power, and eliminate the performance regressions. So what is the point of Bulldozer? c) No, it reinforces the "Bulldozer performs no better than a scaled up K10.5 system would and hence is pointless" narrative. d) @_@ e) No. I don't have a half million dollars of equipment just lying around so that I can run TPC-C (etc.) myself.

dman 5365 days ago

My reading of the phoronix article suggests that Bulldozer does fairly well on the following tests. a) ffmpeg encoding b) parallel io c) x264 encoding d) compression e) mp3 encoding f) c-ray rendering g) smallpt

I will concede that I know virtually nothing of which workloads are representative of what percentage of the market.

DrPizza 5364 days ago

I don't think most server systems are doing much in the way of MP3 or H.264 encoding.

Rendering is more or less equivalent to HPC. Different markets, but similar problem sets (lots of computation, minimal communication or dependencies between threads).

None of those are particularly relevant to typical server workloads; servers are doing things like querying databases, spitting out Web pages, running Java VMs, running virtualization software, that kind of thing.

onenine 5364 days ago

Thanks for the reply but your piece wasn't balanced and well below the quality standards I used to hold for your site (always re-balance expectations!). You didn't talk about power consumption or anything interesting about the platform. We can get press releases from intel.

DrPizza 5364 days ago

What is there that is "interesting" about the platform?

Power consumption was mentioned at a number of points in the article. It's just there's not a whole lot to say about it--it's not exactly a strength of the architecture.

apu 5365 days ago

Are you the author?

DrPizza 5365 days ago

Yes.

DarkShikari 5365 days ago

While the 8-module chip does share a few things (mainly a vector processing unit

And more importantly, the decode and dispatch unit, which only run every other clock for a given core -- thus limiting any given core to a theoretical maximum of a mere 2 IPC, and in practice a lot more than that since the dispatch unit has limitations of its own, nevermind branch mispredictions and such.

DrPizza 5365 days ago

My understanding is that it can give every cycle to a given thread just as long as the other thread doesn't need it to decode anything (if it's idle or whatever). i.e. it can give one thread 4 ops/cycle sustained, given the right workload. But for your purposes, that's probably not any improvement.

DarkShikari 5365 days ago

i.e. it can give one thread 4 ops/cycle sustained, given the right workload.

Dispatch can only do 2 loads per cycle, and 1 store per cycle. Any more, and it stops on that instruction and dispatches nothing more for that cycle. On plenty of workloads, especially typical compiler output for C code, this is not going to nearly reach the 4 ops/cycle maximum, even on a single thread.

onenine 5365 days ago

I'm not sure what it would look like in the video [d]ecoder world, but I don't think that would matter since most the time you'd want to use the 256-bit vector instructions (in practice this would hardly be a high priority until they're nearly ubiquitous...). For use cases where you are addressing large memory regions this hardly seems like that big of a deal. There are times when you can schedule tons of calculations without leaving L1 but for some odd reason people are finding 500GB+ of RAM useful.

DarkShikari 5365 days ago

since most the time you'd want to use the 256-bit vector instructions

There are no 256-bit integer vector instructions on x86, and AVX is slower than SSE on Bulldozer.

onenine 5364 days ago

Sad but true...You can issue SIMD instructions on 4 doubles at once though (and put whatever you want in those 16 registers)....