Hacker News new | ask | show | jobs
by sannysanoff 1858 days ago
I used similar thing, baked on top of cppcoro library (wonderful thing). My application is heavily threaded with hundreds of thousands of short-lived micro-tasks, it's interpreter of highly-parallel expressions, and values are large matrices containing expressions, so it's highly parallelizable.

I moved to C++ coroutines from composable futures (CF) library that had few thread pool implementations if memory serves (and before CF all was written with callback hell). CF out of the box had extra CPU overhead because internal implementation was not efficient enough for my use, too much templates and copying when switching tasks. Also, spawned tasks had to reference shared pointers in user space (my app code), and unneeded frequent shared pointers copying added unneeded overhead.

I rewrote CF implementation later completely, so before coroutines my app used CF API extensively, but with stuff reimplemented, however shared pointers copying was something still far from perfection.

In addition to that I had some abstraction (like async/await/spawn/wait_all) on top of CF API, so transformation of application code was not painful. I had to rewrite synchronization primitives to use mutexes which came with cppcoro, and change my own internal scheduler to use some other new primitives.

I was afraid that storing local variable in coroutines frames (instead of stack frames) would affect performance, but for some reason it did not.

I also expected compilation time to increase, but for some reason it mostly did not. Probably template expansion takes all time, so coroutines code transformation fades in comparison.

Since then I stopped using C++ coroutines .

I dropped it for following reason:

1) unable to debug. Debugger does not have access to local variables, or I cannot enable it. Reference time point: around 9 months ago. Also, stack traces. They are missing, and of course, no help from tools. You have core file, go figure.

2) g++ support was missing in the early days when i employed coroutines (clang 9 was just released), but even clang 10 compiler produced wrong code, when using suspended lambda functions. I use lambdas a lot, and as suspended functions spoil the code base, lambdas inevitably become spoiled too. So, it was just occasional SIGSEGV or wrong values. There was a workaround to move 100% of the lambda body to a separated function and then call it from lambda, but it destroys all lambda beauty.

I moved to chinese libgo (can be found on github). I don't use syscall interceptors it offers, I just use cooperative scheduler it provides, along with synchronization primitives it offers. It's stackful cooperative multitasking which keeps all yummy things. And yes, it seemingly performs slightly better in my case. And yes, i had to patch it slightly.

TLDR: dropped c++ stackless coroutines in favor of stackful coroutines (cooperative stack switching), what a relief!

3 comments

Having only briefly looked through the code base of libgo. It looks like they use boost::context, which is the only good stackful coroutines implementations I've come across. Not being familiar with your project I'm slightly confused about the statement of "hundreds of thousands of short-lives micro-tasks". This is usually a no-go in case of stackful coroutines, as you would waste too much memory as well as have a lot of overhead.

Regarding your debugging issues. I'd be surprised if this doesn't improve over the next year or two. Clang afaik isn't even fully compatible with the final version of coroutines yet. Microsoft has done a lot of work on the compiler itself. I'd assume that Visual Studio will likely ship improvements once they release VS2022(?). Of course these are only guesses from my side.

Summing it up it sounds to me like you suffered from the curse of being an early adopter. It would be interesting to see if you'd have less issues once tooling and compiler support has improved enough.

I have an internal scheduler which prevents spawning too many of them. In any case, it's only stack & context allocation (and freeing) which is not that CPU expensive at my rate (does not show in profiler too much). Also, multiple concurrent processes in worst cases use much virt memory, because of stack allocations, not much resident memory in fact.

> once compiler support has improved enough

I give it min 5 years. It's already few year since it was in clang. I don't believe it will be fixed soon in gdb/lldb. You need to introduce many non-generic things: at least new stack chaining debug information for proper call-stacks, which is (and will!) be threadpool-implementation specific, because otherwise it should be part of standard, part of compiler implementation which is even worse. With local vars it's slightly easier however.

it does not look like it's using boost::context. At least I never saw it in runtime. It's using its own asm routines for save/restore the context.
As far as I can tell the developer is using Fibers on Windows and boost::context on all other operating systems. You can see that he has a forward declaration in "libgo/context/fcontext.h" and then links against the respective boost assembly files in the CMakeLists.txt.
Ah, correct. He's using asm files for fcontext from boost.context. Looks like those are copied into his tree from boost during build phase, and those files were the only I found. Thanks for pointing out. I maybe will need to test it on ARM one day, now I have peace of mind regarding ARM.
> chinese libgo

I believe you mean this one? :

libgo -- a coroutine library and a parallel Programming Library

https://github.com/yyzybb537/libgo

(no information about the main contributor unfortunately)

yes
Well, they have been central to WinRT since the early days, and it was Microsoft input that largely contributed to the design.

So more an issue of tooling than anything else.