| Increasingly the performance limit for modern CPUs is the amount of data you can feed through a single core: basically memcpy() speed. On most x86 cores the limit is around 6 GB/s and about 20 GB/s for Apple M chips. When you see advertised numbers like '200 GB/s' that is total memory bandwidth, or all cores combined. For individual cores, the limit will still be around 6 GB/s. This means even if you write a perfect parser, you cannot go faster. This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read. If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit. The Lite³ serialization format I am working on aims to exploit exactly this, and is able to outperform simdjson by 120x in some benchmarks as a result: https://github.com/fastserial/lite3 |
e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720
I wrote some microbenchmarks for single-threaded memcpy
I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple silicon