Hacker News new | ask | show | jobs
by epicprogrammer 108 days ago
It’s an interesting throwback to SEDA, but physically passing file descriptors between different cores as a connection changes state is usually a performance killer on modern hardware. While it sounds elegant on a whiteboard to have a dedicated 'accept' core and a 'read' core, you end up trading a slightly simpler state machine for massive L1/L2 cache thrashing. Every time you hand off that connection, you immediately invalidate the buffers and TCP state you just built up. There’s a reason the industry largely settled on shared-nothing architectures like NGINX having a single pinned thread handle the entire lifecycle of a request keeps all that data strictly local to the CPU cache. When you're trying to scale, respecting data locality almost always beats pipeline cleanliness.
3 comments

You could presumably have an acceptor thread per core, which passes the fds to core alligned next thread, etc.

That would get you the code simplicity benefits the article suggests, while keeping the socket bound to a single core, which is definitely needed.

Depending on if you actually need to share anything, you could do process per core, thread per loop, and you have no core to core communication from the usual workings of the process (i/o may cross though)

I don't think the author intended "code simplicity" as an end unto itself but a way to reduce cache pressure. He popped into the 2016 discussion [1] to say:

> Another benefit of this design overlooked is that individual cores may not ever need to read memory -- the entire task can run in L1 or L2. If a single worker becomes too complicated this benefit is lost, and memory is much much slower than cache.

I think this is wrong or at least overstated: if you're passing off fds and their associated (kernel- and/or user-side) buffers between cores, you can't run entirely in L1 or L2. And in general, I'd expect data to be responsible for much more cache pressure than code, so I'm skeptical of localizing the code at the expense of the data.

But anyway, if the goal is to organize which cores are doing the work, splitting a single core's work from a single thread (pinnned to it) to several threads (still pinned to it) doesn't help. It just introduces more context switching.

[1] https://news.ycombinator.com/item?id=10874616

> But anyway, if the goal is to organize which cores are doing the work, splitting a single core's work from a single thread (pinnned to it) to several threads (still pinned to it) doesn't help. It just introduces more context switching.

(Mostly agreeing with you, I think). I think looking at the overall system and saying (handwave numbers) 25% of system time is spent on accept, and 75% on request handling, so let's set 25% of cores to accept and 75% to handle requests is unfortunately the wrong way to split the work too. Each core would have a small userland loop, but, communication between processes is expensive. And you have (more than necessary) kernel side communication between processors too because the TCP state will be touched by the processor handling the NIC queue it arrives on, as well as the processor handling the listen queue in userland and then the processor handling the request in userland. Setting up your system to have high interprocessor communication limits the number of cores you can effectively use.

Agreed. The best is going to be to use steering [1] and one pinned thread per core to keep each connection handled on one core as completely as possible.

...with the caveat that it makes the load-balancing much harder when each core is essentially an independent server. If you overload some cores, even briefly, your tail latency will really suffer. And if you decrease utilization to compensate for it, you've lost the efficiency advantage you were going for too. Such that the more conventional approach of a single multi-core reactor can be much better if you don't have a very good load-balancing story.

...another caveat: if you have some massive shared dataset (think search), the cache-efficient approach goes the total other way: each core should own some shard, and a single request should be fanned out across all of them.

...so the best model may vary, but it's not the one in this article.

[1] https://www.kernel.org/doc/html/v5.1/networking/scaling.html

Well, kernels grown some support for steering accept() to worker thread directly. For instance SO_REUSE_PORT (Linux)/SO_REUSE_PORT_LB (FreeBSD).
While I agree that shared nothing wipes the pants performance-wise of shared state, surely the penalty you've outlined is only for super short lived connections?

For longer lived connections the cache is going to thrash on an inevitable context switch anyway (either do to needing to wait for more I/O or normal preemption). As long as processing of I/O is handled on a given core, I don't know if there is actually such a huge benefit. A single pinned thread for the entire lifecycle has the problem that you get latency bottlenecks under load where two CPU-heavy requests end up contending for the same core vs work stealing making use of available compute.

The ultimate benefit would be if you could arrange each core to be given a dedicated NIC. Then the interrupts for the NIC are arriving on the core that's processing each packet. But otherwise you're already going to have to wake up the NIC on a random core to do a cross-core delivery of the I/O data.

TLDR: It's super complex to get a truly shared nothing approach unless you have a single application and you correctly allocate the work. It's really hard to solve generically optimally for all possible combinations of request and processing patterns.