Hacker News new | ask | show | jobs
by iscoelho 643 days ago
If you are using C/C++ for any new app, there is a possibility you are writing code that has a performance requirement.

- mmap/io_uring/drivers and additional "zero-copy" code implementations require consideration about byte order.

- filesystems, databases, network applications can be high throughput and will certainly benefit from being zero-copy (with benefits anywhere from +1% to +2000% in performance.)

This is absolutely not "premature optimization." If you're a C/C++ engineer, you should know off the top of your head how many cycles syscalls & memcpys cost. (Spoiler: They're slow.) You should evaluate your performance requirements and decide if you need to eliminate that overhead. For certain applications, if you do not meet the performance requirements, you cannot ship.

6 comments

Once upon the time I became the de facto admin for a VxWorks box because my code was to be the bottleneck on a task with a min throughput defined in the requirements and we weren't hitting the numbers. I ended up having to KVM into it and run benchmarks in vivo, which meant understanding the command line which I'd never seen before.

People were understandably concerned that we had fucked up in the feasibility phase of the project. Lots of people get themselves in trouble this way, and this was a 9 figure piece of hardware sitting idle while our app picked its nose crunching data, if we didn't finish our work on time during maintenance windows.

But I was on my longest hot streak of accurate perf estimates in my career and this one was not going to be my Icarus moment. It ended being tweaks needed from the compiler writer and from Wind River (DMA problem). I had to spend a lot of social capital on all of this, especially the Wind River conference call (which took ten minutes for them to come around to my suggestion for a fix that they shipped us in a week. After months and months of begging for a conference call).

100% on the business implications. Although a lot of engineers never have to touch it, DMA (& zero-copy) implementations are foundational to the performance of modern day computers that we sometimes take for granted.
The hard drive was running so slow I exclaimed “it’s almost like this drive is running in PATA mode.”

It was. Motherboard and CPU were newer than the VxWorks version and it was running in compatibility mode. We treated it like the previous hardware revision it was backward compatible with and 30% more throughput like magic. Exactly as predicted.

A memcpy should not be slow. It should be nearly as fast as generic memory copying can be. Most of the time you shouldn't even hit the actual function, but instead a bit of code generated by the compiler that does exactly the copy you need.
memcpy is extremely slow. On any high-load Linux webserver, you can type "perf top" and see 20%~ of the CPU usage consumed by memcpy/syscalls/virtual memory.

This article is a good demonstration of the performance improvements via mmap zero-copy: https://medium.com/@kaixin667689/zero-copy-principle-and-imp...

Netflix also relies on zero-copy via kTLS & zero-copy TLS to serve 400Gbps: https://papers.freebsd.org/2021/eurobsdcon/gallatin-netflix-...

However, the performance gap can get even larger! (The kernel is historically not great at this.) For NVME & packet processors, you can see an increase of 10,000%+ in performance easily via a zero-copy implementation. See: https://www.dpdk.org https://spdk.io

memcpy gets weird with pointer aliasing as well. There's a slower path if the pointers can end up overlapping, and you either have to prove it programatically like Java does, do the defensive copy, or YOLO it and hope.
memcpy is only defined for non-overlapping memory regions (otherwise you should use memmove), but many platforms use memmove for memcpy anyway to avoid breaking user programs in unpredictable ways. Apparently this has also led to some arguments and glibc version incompatibilities (https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html).
I don’t know why I said “path”, I meant instruction.
Any implementation of an algorithm is slow when your baseline is not performing the computation at all.
The fastest line of code is no line at all.‡

[‡]: Unless it's some weird architectural fluke with pipelining.

Haha, it's zero-copy! I never said it was "faster-copy" (-:
Apples and oranges. They're very different things, even if there's some overlap in use cases.
Yeah ive always been blown away by how fast memcpy is. I'm guessing the OP is from a different world of engineering than I am.
The compiler can optimize this. See https://gcc.godbolt.org/z/hxW7hhrd7

  #include <cstdint>
  uint32_t read_le_uint32(const uint8_t* p)
  {
      return p[0] | (p[1] << 8) | (p[2] << 16) | (p[3] << 24);
  }
ends up as

  read_le_uint32(unsigned char const*):
          mov     eax, dword ptr [rdi]
          ret
This works with Clang and gcc on x86_64 (but not with MSVC).
The purpose of zero-copy can be to avoid deserialization at all. All you do to deserialize is:

uint8_t *buf = ...; struct example_payload *payload = (struct example_payload *) buf;

That's why when you access the variables you need to byte order swap. This is absolutely not portable, I agree. I also agree that it is error-prone. However, it is the reality of a lot of performance critical software.

Yeah, I’ve occasionally had to manually special case big/little endian code, but most of the time you can write the generic code and the optimizer will take care of it. Unless you’re doing something very complicated it’s a quite trivial optimization to perform.
My uses of mmap have only over been memoization. Where I didn't care about byte order, and instead just assumed the files wouldn't be portable between any two computers.

If you are going zero copy, you either need to give up on any kind of portability, or delve deep into compiler flags to standardize struct layout.

maybe i'm missing something because I don't code network drivers but wouldn't it be something like...

if it's little endian (on the wire), the process would be like:

    (value[0] | (value[1] << 8) | (value[2] << 16) | (value[3] << 24))
and in big endian (again, on the wire, architecture endianness irrelevant) it would be the same thing with the indices reversed, where "value" is the 4 bytes read in off the wire?
The performance would be absolutely horrendous if network drivers were programmed this way. DMA (Direct Memory Access) is all about avoiding deserialization and copies of the data.
> memcpy slow

Uh...

Compared to doing nothing, yes it's "slow."