Hacker News new | ask | show | jobs
by dfbrown 2956 days ago
Worth noting that ucontext is quite slow (at least on linux): https://www.boost.org/doc/libs/1_67_0/libs/context/doc/html/...
2 comments

It's fast enough for many applications.

I think ucontext is an excellent starting point for a general implementation. You just abstract it with a thin veneer and adopt faster implementations as needed where applicable.

It's slower than Boost Context for sure, but still around 10x faster than (ab)using Pthreads on Linux. Until someone releases a faster standalone C library it's still the fastest portable solution for projects like Cixl that can't afford dragging C++ around.
There's already many fast, portable, standalone C libraries for this. For instance, see table 1 of https://www.gnu.org/software/pth/rse-pmt.ps. The assembly code for a context switch is pretty minimal (on most platforms, it's just setjmp and longjmp) if you don't try manage the signal state. I would be surprised if there was significant variance in performance (other than whether it chooses to switch the signal mask). Additionally, many language runtimes directly support it without any extra effort on your part. So if you choose to instead use one of those language, you get a fast portable solution without needing to do any extra work to pick a support library (for example, D-Lang LDC, PyPy, Go, Julia).
I implemented coroutines for C with assembly [1] (x86 32 and 64 bit). I took advantage of the calling convention to cut down on the amount of state to save (4 registers for x86 32b and six for x86 64b). Mixing this with signals is probably unwise [2]. So far I've tested the code on Linux and Mac OS-X and it works (although I might not use it for C++ either).

[1] https://github.com/spc476/C-Coroutines

[2] In my not-so humble opinion, using signals at all is not wise.

The shortest contest switch sequence I could come up on x86-64 is three instructions:

  xchg  %rsp, %rdx
  leaq  1f(%rip), %rax
  jmp   *%rsi
1:

It it expect the target stack ptr/ip pair to be in rdx/rsi and saves the current stack ptr and ip in rdx/rax. It does not save any register and uses gcc asm clobbers to instruct the compiler to save any other register.

Code at [1]. The comments about hacks and ub is because I'm trying to transparently propagate exceptions across coroutines, otherwise the stack switching us fairly robust (although GCC specific).

[1] https://github.com/gpderetta/delimited/blob/master/delimited...

Signals, done correctly, are hard. I agree with your NSHO for the most part. Neat code!
Not faster than dealing directly with ucontext from what I've seen; many wrap it directly and the rest tend to emulate using signals, setjmp and prayers.

I would love to be wrong though...

ucontext modifies the signal mask, requiring a syscall. That’s very expensive (as shown by the boost benchmark above), and usually unnecessary. “emulate” is a rather negative sounding way to describe running effective the same code as ucontext does (which also happens to typically be the same as sigsetjmp) - it’s not like ucontext has some privledged permissions. It’s “just” a context switch.
I was under the impression that Boost Context is more than that; at least that's what the amount of assembler code tells me; but I'll be the first to admit I don't have much patience for deciphering modern C++. I get that it's possible to invoke the same functionality as ucontext minus signal masks manually, but I'm not convinced it would save enough cycles to pay for the added complexity.
Saving the signal mask easily cost hundreds to thousand cycles. One or two order of magnitude more than the rest of contex switching.
For the record: I just benchmarked GnuPth with NULL sigmask against ucontext for the example in the post and it's slightly (3.4/3.2s) slower.