Hacker News new | ask | show | jobs
by Dylan16807 702 days ago
> Like look at this 20 core processor! Oh wait, it's really an 8 core when it comes to performance.

The E cores are about half as fast as the P cores depending on use case, at about 30% of the size. If you have a program that can use more than 8 cores, then that 8P+12E CPU should approach a 14P CPU in speed. (And if it can't use more than 8 cores then P versus E doesn't matter.) (Or if you meant 4P+16E then I don't think those exist.)

> Hard to compare that to a 12 core 3D cached Ryzen with even higher clock...

Only half of those cores properly get the advantage of the 3D cache. And I doubt those cores have a higher clock.

AMD's doing quite well but I think you're exaggerating a good bit.

2 comments

> If you have a program that can use more than 8 cores, then that 8P+12E CPU should approach a 14P CPU in speed

Only if you use work stealing queues or (this is ridiculously unlikely) run multithreaded algorithms that are aware of the different performance and split the work unevenly to compensate.

Or if you use a single queue... which I would expect to be the default.

Blindly dividing work units across cores sounds like a terrible strategy for a general program that's sharing those cores with who-knows-what.

It’s a common strategy for small tasks where the overhead of dispatching the task greatly exceeds the computation of it. It’s also a better way to maximize L1/L2 cache hit rates by improving memory locality.

Eg you have 100M rows and you want to cluster them by a distance function (naively), running dist(arr[i], arr[j]) is crazy fast, the problem is just that you have so many of them. It is faster to run it on one core than dispatch it from one queue to multiple cores, but best to assign the work ahead of time to n cores and have them crunch the numbers.

It has always been a bad idea to dispatch so naively and dispatch to the same number of threads as you have cores. What if a couple cores are busy, and you spend almost twice as much time as you need waiting for the calculation to finish? I don't know how much software does that, and most of it can be easily fixed to dispatch half a million rows at a time and get better performance on all computers.

Also on current CPUs it'll be affected by hyperthreading and launch 28 threads, which would probably work out pretty well overall.

> What if a couple cores are busy

If you don't pin them to cores, the OS is still free to assign threads to cores as it pleases. Assuming the scheduler is somewhat fair, threads will progress at roughly the same rate.

I would not assume it's sufficiently fair to make that a good algorithm.

Even a small bias could turn a 5 minute calculation into a 6 or 7 minute calculation as the stragglers finish up.

> run multithreaded algorithms that are aware of the different performance and split the work unevenly to compensate.

This is what the Intel Thread Director [0] solves.

For high-intensity workloads, it will prioritize assigning them to P-cores.

[0] https://www.intel.com/content/www/us/en/support/articles/000...

Then you no longer have 14 cores in this example, but only len(P) cores. Also most code written in the wild isn’t going to use an architecture-specific library for this.
The P cores being presented as two logical cores and E cores presented as a single logical core results in this kind of split already.
Yeah, the 20 core Intels are benchmarking about the same as the 12 core AMD X3Ds. But many people just see 20>12. Either one is more than fine for most people.

"Oh wait, it's really an 8 core when it comes to performance [cores]". So yes, should not be an 8 core all together, but like you said about 14 cores, or 12 with the 3D cache.

"And I doubt those cores have a higher clock."

I'm not sure what we're comparing them to. They should be capable of higher clock than the E cores. I thought all the AMD cores had the ability to hit the max frequency (but not necessarily at the same time). And some of the cores might not be able to take advantage of the 3D cache, but that doesn't limit their frequency, from my understanding.

It’s kind of funny and reminiscent of the AMD bulldozer days where they had a ton of cores compared to the contemporary Intel chips, especially at low/mid price points but the AMD chips were laughably underwhelming for single core performance which was even more important then.

I can’t speak to the Intel chips because I’ve been out of the Intel game for a long time but my 5700X3D does seem to happily run all cores at max clock speed.

> I'm not sure what we're comparing them to. They should be capable of higher clock than the E cores.

Oh, just higher clocked than the E cores. Yeah that's true, but if you're using that many cores at once you probably only care about total speed.

You said 12 core with higher clock versus 8, so I thought you were comparing to the performance cores.

> I thought all the AMD cores had the ability to hit the max frequency (but not necessarily at the same time).

The cores under the 3D cache have a notable clock penalty on existing CPUs.

> And some of the cores might not be able to take advantage of the 3D cache, but that doesn't limit their frequency, from my understanding.

Right, but my point is it's misleading to call out higher core count and the advantages of 3D stacking. The 3D stacking mostly benefits the cores it's on top of, which is 6-8 of them on existing CPUs.

"The cores under the 3D cache have a notable clock penalty on existing CPUs."

Interesting. I can't find any info on that. It seems that makes sense though since the 7900X is 50 TDP higher than the 7900X3D.

"Right, but my point is it's misleading to call out higher core count and the advantages of 3D stacking"

Yeah, that makes sense. I didn't realize there was a clock penalty on some of the cores with the 3D cache and that only some cores could use it.

It's due to the stacked cache being harder to cool and not supporting as high of a voltage. So the 3D CCD clocks lower, but for some workloads it's still faster (mainly ones dealing with large buffers, like games, most compute heavy benchmarks fit in normal caches and the non 3D V-Cache variants take the win).