Hacker News new | ask | show | jobs
by n_e 53 days ago
> Why is reserving a megabyte of stack space "expensive"?

Because if you use one thread for each of your 10,000 idle sockets you will use 10GB to do nothing.

So you'll want to use a better architecture such as a thread pool.

And if you want your better architecture to be generic and ergonomic, you'll end up with async or green threads.

3 comments

> Because if you use one thread for each of your 10,000 idle sockets you will use 10GB to do nothing.

1.On a system that is handling 10k concurrent requests, the 10GB of RAM is going to be a fraction of what is installed.

2. It's not 10GB of RAM anyway, it's 10GB of address space. It still only gets faulted into real RAM when it gets used.

> 1.On a system that is handling 10k concurrent requests, the 10GB of RAM is going to be a fraction of what is installed.

My example (and the c10k problem) is 10k concurrent connections, not 10k concurrent requests.

> 2. It's not 10GB of RAM anyway, it's 10GB of address space. It still only gets faulted into real RAM when it gets used.

Yes, and that's both memory and cpu usage that isn't needed when using a better concurrency model. That's why no high-performance server software use a huge amount of threads, and many use the reactor pattern.

> Yes, and that's both memory and cpu usage that isn't needed

No, it literally is not. The "memory" is just entries in a page table in the kernel and MMU. It shouldn't worry you at all.

Nor is the CPU used by the kernel to manage those threads going to be necessarily less efficient than someone's handrolled async runtime. In fact given it gets more eyes... likely more.

The sole argument I can see is just avoiding a handful of syscalls and excessive crossing of the kernel<->userspace brain blood barrier too much.

> > Yes, and that's both memory and cpu usage that isn't needed No, it literally is not. The "memory" is just entries in a page table in the kernel and MMU. It shouldn't worry you at all.

Only if you never free one of those stacks. TLB flushes can be quite expensive.

Fair enough, though it's not like an async tasks runner doesn't have its own often relatively expensive book-keeping.
> 1.On a system that is handling 10k concurrent requests, the 10GB of RAM is going to be a fraction of what is installed

I've written massively concurrent systems where each connection only handled maybe a few kilobytes of data.

Async io is a massive win in those situations.

This describes many rest endpoints. Fetch a few rows from a DB, return some JSON.

> you will use 10GB to do nothing.

You don't pay for stack space you don't use unless you disable overcommit. And if you disable overcommit on modern linux the machine will very quickly stop functioning.

The amount of stack you pay for on a thread is proportional to the maximum depth that the stack ever reached on the thread. Operating systems can grow the amount of real memory allocated to a thread, but never shrink it.

It’s a programming model that has some really risky drawbacks.

> Operating systems can grow the amount of real memory allocated to a thread, but never shrink it.

Operating systems can shrink the memory usage of a stack.

  madvise(page, size, MADV_DONTNEED);
Leaves the memory mapping intact but the kernel frees underlying resources. Subsequent accesses get either new zero pages or the original file's pages.

Linux also supports mremap, which is essentially a kernel version of realloc. Supports growing and shrinking memory mappings.

  stack = mremap(stack, old_size, old_size / 2, MREMAP_MAYMOVE, 0);
Whether existing systems make use of this is another matter entirely. My language uses mremap for growth and shrinkage of stacks. C programs can't do it because pointers to stack allocated objects may exist.
> C programs can't do it because pointers to stack allocated objects may exist.

They sure shouldn't exist to the unused region of the stack though; if they do, that's a bug (because anything could claim that memory now). You should be free and clear to release stack pages past your current stack pointer.

High level languages have entire runtime systems dedicated to managing resources like that. My language can allocate, grow, shrink and deallocate stacks dynamically. It has complete visibility into everything, and the stacks themselves are designed to be relocatable and position-independent.

In C it's impossible to even get the stack pointer without dropping to assembly or using compiler builtins. It's hard to know where the stack starts or even how big it is.

I do agree with this, but just to be clear (for others), you don't need any runtime managing resource lifecycles to know that there shouldn't be pointers into free memory, such as the currently unused portion of the stack.
There isn’t any operating system or compiler that does this today, and it probably isn’t worth it to pursue. Enlarging the stack via page fault is really expensive, so you would need really advanced heuristics to prevent repeatedly unmapping/remapping those pages.

The correct tool for myriad of small tasks is coroutines / green threads / async tasks, so why spend any energy optimizing threads for that purpose instead of what they are already good at?

In the general case it's absolutely not worth it. In the context of "you want a large number of OS threads, and are willing to go to some effort", it's theoretically something you'd want to do; suppose the startup for a thread is measurably a high water mark for stack usage, after startup the steady state stack usage won't exceed 20% of that high mark, and you'd like as many threads/stacks as possible.

Coroutines / green threads / async tasks will all do this too, but there's something to be said for using/relying on the system scheduler instead of bringing your own in in addition.

Stack memory is never unmapped until the thread terminates as far as I know. I don’t know of any kernel that does this, for precisely the reason you arrive at by the very last sentence.
It's just normal pages to the kernel. In theory, it's totally possible for the program to munmap some of its own stack's pages if it was sophisticated enough. Typical C programs just aren't capable of it, at least not without great effort.
On a 64-bit system, 10 GB of address space is nothing.
10 GB of RAM is certainly something though. Especially in current times.
Except if those threads are actually faulting in all of that memory and making it resident, they'd be doing the same thing, just on the heap, for a classic async coroutine style application.
If you have hugepages enabled, all of those threads are probably faulting in a fair amount of memory.
Only if you've actually faulted in 2MB contiguously already.