| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by iscoelho 643 days ago

If you are using C/C++ for any new app, there is a possibility you are writing code that has a performance requirement.

- mmap/io_uring/drivers and additional "zero-copy" code implementations require consideration about byte order.

- filesystems, databases, network applications can be high throughput and will certainly benefit from being zero-copy (with benefits anywhere from +1% to +2000% in performance.)

This is absolutely not "premature optimization." If you're a C/C++ engineer, you should know off the top of your head how many cycles syscalls & memcpys cost. (Spoiler: They're slow.) You should evaluate your performance requirements and decide if you need to eliminate that overhead. For certain applications, if you do not meet the performance requirements, you cannot ship.

6 comments

hinkley 643 days ago

Once upon the time I became the de facto admin for a VxWorks box because my code was to be the bottleneck on a task with a min throughput defined in the requirements and we weren't hitting the numbers. I ended up having to KVM into it and run benchmarks in vivo, which meant understanding the command line which I'd never seen before.

People were understandably concerned that we had fucked up in the feasibility phase of the project. Lots of people get themselves in trouble this way, and this was a 9 figure piece of hardware sitting idle while our app picked its nose crunching data, if we didn't finish our work on time during maintenance windows.

But I was on my longest hot streak of accurate perf estimates in my career and this one was not going to be my Icarus moment. It ended being tweaks needed from the compiler writer and from Wind River (DMA problem). I had to spend a lot of social capital on all of this, especially the Wind River conference call (which took ten minutes for them to come around to my suggestion for a fix that they shipped us in a week. After months and months of begging for a conference call).

iscoelho 643 days ago

100% on the business implications. Although a lot of engineers never have to touch it, DMA (& zero-copy) implementations are foundational to the performance of modern day computers that we sometimes take for granted.

hinkley 642 days ago

The hard drive was running so slow I exclaimed “it’s almost like this drive is running in PATA mode.”

It was. Motherboard and CPU were newer than the VxWorks version and it was running in compatibility mode. We treated it like the previous hardware revision it was backward compatible with and 30% more throughput like magic. Exactly as predicted.

AlotOfReading 643 days ago

A memcpy should not be slow. It should be nearly as fast as generic memory copying can be. Most of the time you shouldn't even hit the actual function, but instead a bit of code generated by the compiler that does exactly the copy you need.

iscoelho 643 days ago

memcpy is extremely slow. On any high-load Linux webserver, you can type "perf top" and see 20%~ of the CPU usage consumed by memcpy/syscalls/virtual memory.

This article is a good demonstration of the performance improvements via mmap zero-copy: https://medium.com/@kaixin667689/zero-copy-principle-and-imp...

Netflix also relies on zero-copy via kTLS & zero-copy TLS to serve 400Gbps: https://papers.freebsd.org/2021/eurobsdcon/gallatin-netflix-...

However, the performance gap can get even larger! (The kernel is historically not great at this.) For NVME & packet processors, you can see an increase of 10,000%+ in performance easily via a zero-copy implementation. See: https://www.dpdk.org https://spdk.io

hinkley 643 days ago

memcpy gets weird with pointer aliasing as well. There's a slower path if the pointers can end up overlapping, and you either have to prove it programatically like Java does, do the defensive copy, or YOLO it and hope.

nyanpasu64 642 days ago

memcpy is only defined for non-overlapping memory regions (otherwise you should use memmove), but many platforms use memmove for memcpy anyway to avoid breaking user programs in unpredictable ways. Apparently this has also led to some arguments and glibc version incompatibilities (https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html).

hinkley 642 days ago

I don’t know why I said “path”, I meant instruction.

formerly_proven 643 days ago

Any implementation of an algorithm is slow when your baseline is not performing the computation at all.

hinkley 643 days ago

The fastest line of code is no line at all.‡

[‡]: Unless it's some weird architectural fluke with pipelining.

iscoelho 643 days ago

Haha, it's zero-copy! I never said it was "faster-copy" (-:

AlotOfReading 642 days ago

Apples and oranges. They're very different things, even if there's some overlap in use cases.

nightowl_games 643 days ago

Yeah ive always been blown away by how fast memcpy is. I'm guessing the OP is from a different world of engineering than I am.

neonz80 643 days ago

The compiler can optimize this. See https://gcc.godbolt.org/z/hxW7hhrd7

  #include <cstdint>
  uint32_t read_le_uint32(const uint8_t* p)
  {
      return p[0] | (p[1] << 8) | (p[2] << 16) | (p[3] << 24);
  }

ends up as

  read_le_uint32(unsigned char const*):
          mov     eax, dword ptr [rdi]
          ret

This works with Clang and gcc on x86_64 (but not with MSVC).

iscoelho 643 days ago

The purpose of zero-copy can be to avoid deserialization at all. All you do to deserialize is:

uint8_t *buf = ...; struct example_payload *payload = (struct example_payload *) buf;

That's why when you access the variables you need to byte order swap. This is absolutely not portable, I agree. I also agree that it is error-prone. However, it is the reality of a lot of performance critical software.

plorkyeran 643 days ago

Yeah, I’ve occasionally had to manually special case big/little endian code, but most of the time you can write the generic code and the optimizer will take care of it. Unless you’re doing something very complicated it’s a quite trivial optimization to perform.

rocqua 643 days ago

My uses of mmap have only over been memoization. Where I didn't care about byte order, and instead just assumed the files wouldn't be portable between any two computers.

If you are going zero copy, you either need to give up on any kind of portability, or delve deep into compiler flags to standardize struct layout.

pmarreck 643 days ago

maybe i'm missing something because I don't code network drivers but wouldn't it be something like...

if it's little endian (on the wire), the process would be like:

    (value[0] | (value[1] << 8) | (value[2] << 16) | (value[3] << 24))

and in big endian (again, on the wire, architecture endianness irrelevant) it would be the same thing with the indices reversed, where "value" is the 4 bytes read in off the wire?

iscoelho 643 days ago

The performance would be absolutely horrendous if network drivers were programmed this way. DMA (Direct Memory Access) is all about avoiding deserialization and copies of the data.

paulddraper 643 days ago

> memcpy slow

Uh...

Compared to doing nothing, yes it's "slow."