Not if they're done as split/segmented stacks (http://gcc.gnu.org/wiki/SplitStacks)-- basically, you have a collection of 4KB stack pages for each thread instead of one large up-front allocation, and you grow it as needed. It costs a few instructions per function entry/exit, but overall cost is negligible and it allows you to run thousands of coroutines without issues.
If you allocate the stacks contiguously using mmap then memory is only used as it is accessed. That's not the problem. The problem is that 4000 concurrent non-trivial threads is a resource hog no matter how the stacks are allocated.