Since performance was a consideration for you, how does it compare to TBB or even say OpenMP? I've used those as well as the task system that comes with ISPC for HPC stuff, but I like the lightweight C11 approach here.
I'm almost certain this will be slower than OpenMP because it uses a centralized task queue that gets locked. OpenMP uses a decentralized work-stealing task queue called the Chase-Lev Deque. There's a C implementation in this paper:
https://fzn.fr/readings/ppopp13.pdf