| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lunixbochs 204 days ago

your single core numbers seem way too low for peak throughput on one core, unless you stipulate that all cores are active and contending with each other for bandwidth

e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720

I wrote some microbenchmarks for single-threaded memcpy

    zen 2 (8-channel DDR4)
    naive c:
      17GB/s
    non-temporal avx:
      35GB/s

    Xeon-D 1541 (2-channel DDR4, my weakest system, ten years old)
    naive c:
      9GB/s
    non-temporal avx:
      13.5GB/s

    apple silicon tests
    (warm = generate new source buffer, memset(0) output buffer, add memory fence, then run the same copy again)

    m3
    naive c:
      17GB/s cold, 41GB/s warm
    non-temporal neon:
      78GB/s cold+warm

    m3 max 
    naive c:
      25GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    m4 pro
    naive c:
      13.8GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    (I'm not actually sure offhand why asi warm is so much faster than cold - the source buffer is filled with new random data each iteration, I'm using memory fences, and I still see the speedup with 16GB src/dst buffers much larger than cache. x86/linux didn't have any kind of cold/warm test difference. my guess would be that it's something about kernel page accounting and not related to the cpu)

I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple silicon

4 comments

nine_k 204 days ago

As much as I can understand a Zen 5 CPU core can run two AVX512 operations per clock (1024 bits) + 4 integer operations per clock (which use up FPU circuitry in the process), so additional 256 bits. At 4 GHz, this is 640 GB/s.

I suppose that in real life such ideal condition do not occur, but it shows how badly the CPU is limited by its memory bandwidth for streaming tasks. Its maximum memory-read bandwidth is 768 bits per clock. only 60% of its peak bit-crunching performance. DRAM bandwidth is even more limiting. And this is a single core of at least 12 (and at most 64).

link

kvemkon 204 days ago

> CPU is limited by its memory bandwidth for streaming tasks

That must be the reason, why EPYC 9175F exists. It is only 16-core CPU, but all 16 8-core CCDs are populated and only one core on each is active.

The next gen EPYC is rumored to have 16 instead of 12 memory channels (which were 8 only 4-5 years ago).

link

tanelpoder 204 days ago

This also leaves more power & thermal allowance for the IO Hub on the CPU chip and I guess the CPU is cheaper too.

If your workload is mostly about DMAing large chunks of data around between devices and you still want to examine the chunk/packet headers (but not touch all payload) on the CPU, this could be a good choice. You should have the full PCIe/DRAM bandwidth if all CCDs are active.

Edit: Worth noting that a DMA between PCIe and RAM still goes through the IO Hub (Uncore on Intel) inside the CPU.

link

vjerancrnjak 204 days ago

It is interesting that despite this we still have programming languages and libraries that cannot exploit pipelining to actually demonstrate IO is the bottleneck and not CPU

link

torginus 204 days ago

Thanks for the detailed writeup! This made me think of an interesting conundrum - with how much RAM modern computers come with (16GB is considered to be on the small side), having the CPU read the entire contents of RAM takes a nontrivial amount of time.

A single threaded Zen2 program could very well take 1 second to scan through your RAM, during which it's entirely trivial to read stuff from disk, so the modern practice of keeping a ton of stuff in RAM might be actually hurting performance.

Algorithms, such as garbage collection, which scan the entire heap, where the single-threaded version is probably slower than the naive zen 2 memcpy might run for more than a second even on a comparatively modest 16GB heap, which might not be even acceptable.

link

zozbot234 204 days ago

It's true that GC is a huge drain on memory throughput, even though it doesn't have to literally scan the entire heap (only potential references have to be scanned). The Golang folks are running into this issue, it's reached a point where the impact of GC traffic is itself a meaningful bottleneck on performance, and GC-free languages have more potential for most workloads.

link

jauntywundrkind 204 days ago

I agree, I don't think these numbers check out. IIRC people were a bit down on manycore Clearwater Forest in August for core complexes of 4x cores each sharing a 35GB/s link. (And also sharing 4MB of pretty alright 400GB/s L2 cache among them). This is a server chip so expectations are higher, but 6GB/s per core seems very unlikely.

https://chipsandcheese.com/p/intels-clearwater-forest-e-core...

link

eliasdejong 204 days ago

> A key feature of this code is that it skips CPU cache when copying

Are those numbers also measured while skipping the CPU cache?

link

lunixbochs 204 days ago

naive c is just a memcpy. non-temporal uses the streaming instructions.

link