|
|
|
|
|
by scottlamb
108 days ago
|
|
I don't think the author intended "code simplicity" as an end unto itself but a way to reduce cache pressure. He popped into the 2016 discussion [1] to say: > Another benefit of this design overlooked is that individual cores may not ever need to read memory -- the entire task can run in L1 or L2. If a single worker becomes too complicated this benefit is lost, and memory is much much slower than cache. I think this is wrong or at least overstated: if you're passing off fds and their associated (kernel- and/or user-side) buffers between cores, you can't run entirely in L1 or L2. And in general, I'd expect data to be responsible for much more cache pressure than code, so I'm skeptical of localizing the code at the expense of the data. But anyway, if the goal is to organize which cores are doing the work, splitting a single core's work from a single thread (pinnned to it) to several threads (still pinned to it) doesn't help. It just introduces more context switching. [1] https://news.ycombinator.com/item?id=10874616 |
|
(Mostly agreeing with you, I think). I think looking at the overall system and saying (handwave numbers) 25% of system time is spent on accept, and 75% on request handling, so let's set 25% of cores to accept and 75% to handle requests is unfortunately the wrong way to split the work too. Each core would have a small userland loop, but, communication between processes is expensive. And you have (more than necessary) kernel side communication between processors too because the TCP state will be touched by the processor handling the NIC queue it arrives on, as well as the processor handling the listen queue in userland and then the processor handling the request in userland. Setting up your system to have high interprocessor communication limits the number of cores you can effectively use.