Hacker News new | ask | show | jobs
by scottlamb 108 days ago
I don't think the author intended "code simplicity" as an end unto itself but a way to reduce cache pressure. He popped into the 2016 discussion [1] to say:

> Another benefit of this design overlooked is that individual cores may not ever need to read memory -- the entire task can run in L1 or L2. If a single worker becomes too complicated this benefit is lost, and memory is much much slower than cache.

I think this is wrong or at least overstated: if you're passing off fds and their associated (kernel- and/or user-side) buffers between cores, you can't run entirely in L1 or L2. And in general, I'd expect data to be responsible for much more cache pressure than code, so I'm skeptical of localizing the code at the expense of the data.

But anyway, if the goal is to organize which cores are doing the work, splitting a single core's work from a single thread (pinnned to it) to several threads (still pinned to it) doesn't help. It just introduces more context switching.

[1] https://news.ycombinator.com/item?id=10874616

1 comments

> But anyway, if the goal is to organize which cores are doing the work, splitting a single core's work from a single thread (pinnned to it) to several threads (still pinned to it) doesn't help. It just introduces more context switching.

(Mostly agreeing with you, I think). I think looking at the overall system and saying (handwave numbers) 25% of system time is spent on accept, and 75% on request handling, so let's set 25% of cores to accept and 75% to handle requests is unfortunately the wrong way to split the work too. Each core would have a small userland loop, but, communication between processes is expensive. And you have (more than necessary) kernel side communication between processors too because the TCP state will be touched by the processor handling the NIC queue it arrives on, as well as the processor handling the listen queue in userland and then the processor handling the request in userland. Setting up your system to have high interprocessor communication limits the number of cores you can effectively use.

Agreed. The best is going to be to use steering [1] and one pinned thread per core to keep each connection handled on one core as completely as possible.

...with the caveat that it makes the load-balancing much harder when each core is essentially an independent server. If you overload some cores, even briefly, your tail latency will really suffer. And if you decrease utilization to compensate for it, you've lost the efficiency advantage you were going for too. Such that the more conventional approach of a single multi-core reactor can be much better if you don't have a very good load-balancing story.

...another caveat: if you have some massive shared dataset (think search), the cache-efficient approach goes the total other way: each core should own some shard, and a single request should be fanned out across all of them.

...so the best model may vary, but it's not the one in this article.

[1] https://www.kernel.org/doc/html/v5.1/networking/scaling.html