You know, this [1] really ought to permanently put away the idea I think you're trying to reference, which is that only "event based" systems can be performant. There's plenty of "thread" or "process" based approaches that do quite well, including I believe the uppermost tier of every benchmark on that site. The idea that threads or processes are intrinsically slow was sheer unmitigated propaganda, and probably not only failed to contain a grain of truth, but are actively false. (Some thread implementations were slower than others, but that turns out to have been the implementations rather than the idea.) Event based systems inevitably have a lot of function calls in them, and that will probably in the end be slower than properly done threads or continuation-based approaches, always, because of that overhead.
People measure different things different ways and then draw conclusion (or tweak measurement parameters until it supports their already pre-conceived belief).
Event based system can be more performant in some cases and slow in another cases. If there is not much opportunity for CPU to do any work, then event based system will often outperform threads. One example is proxies. I already gave haproxy as an example, so I'll repeat it here as well. It is single threaded event based by default. It is certainly performant. Why? Because in a simplified model it just shuffles data from one socket to another. Pretty straight forward. Introducing multiple threads and context switches might just thrash caches around and actually make it worse (I have seen that happen).
Now add some CPU work in there. Say make each connection compute something, serialize some JSON. Like in those benchmarks, they use a DB driver get a row, serialize it and return. Ok there is some work. Now it is more likely that multi-threaded will help. But again one can surely tweak CPU affinities, thread pool sizes, hyper-threading BIOS settings, db driver types to really change things up. Threads take up memory. Not an insignificant amount. Now I like green threads, Erlang's processes, Go's goroutines because they are lightweight. (At least Erlang's processes map N:M to CPUs for parallel execution on the host machine).
So I guess my point is you are right that event based are not always and strictly more performant. But I also think in certain cases it can beat multi-threaded code (thread memory size, context switches, cache thrashing). That benchmark there, I wouldn't take it too seriously just like I wouldn't take Language Shootout too seriously.
The whole event-based dogma is that event-based systems are not merely performance-competitive, but performance dominant. If they even tie, but also incur the extra development expense of significantly-increased code complexity, they still lose. If the event-based systems can't stomp thread-based systems in a benchmark, they're unlikely to do it in the real-world either carrying around the extra baggage of complicated code... it's not like event-based code scales up gracefully in size as the problem size increases whereas the (modern [1]!) threading approaches explode in complexity, what with the truth being the exact opposite of that.
Taking benchmarks too seriously is a problem; dismissing them too cavalierly is a problem, too. Those benchmarks may reflect the truth to seven significant digits... but based on what I see in there, I suspect they reflect the truth to about one and a half digits.
I've got some event-based code I manage at work, because it was the best choice. But it wasn't the best choice because of performance, or code complexity, or any of the other putative advantages of event-based systems, it was the best choice due to the local language-use landscape pushing me into a language in which event-based systems are the only credible choice. You know that comment that "design patterns show a weakness in your language?" I don't 100% agree with that, but it's true here; event-based server loops are a sign of a weakness in your language, not a good idea.
[1]: Here defined to a first approximation as "shared little-to-nothing" threading models, rather than the old-school approaches that produced enormous program-state-space complexity.
I agree with you, hopefully you see that, but hopefully you can also see why for heavily IO bound application event based systems (basically code woven around a giant epoll/select/poll/kqueue system call) can be faster.
Modern machines are different than those 10-15 years ago. Caches and SMP typologies sometimes play serious roles in what could be an outcome of a benchmark. Threads are often heavyweight memory-wise. That is why the 10K problem had started to be solved better by event based systems.
Even looking at your benchmarks link, I would say more on the top are actually event based. "cpoll" ones look like event based centered around a polling loop. So is openresty -- which is a set of Lua modules working in nginx, also an evented server (but it is also mixed with a set of worker processes from what I understand).
And I like what you said about even if they are the same threaded ones are better. Yes. Not only that, for me it is 10x. Even if threaded ones are 10x slower and that is tolerable, the I would rather pick that. Why? Because code is clearer and matches better which the intuitive breakdown of a problem domain. That is why I like Erlang, Go, Rust and Akka -- actor models just model the world better (a single request is sequential there are clear steps that work in one after another to process it, but there is concurrency between each requests). An actor models that perfectly and I like that.
I also, like you, dealt with an evented promises/futures based system for years and it wasn't fun. It works great for little benchmarks and examples, once it grows it becomes a set of tangled slinkies that only the original writer (me in this case) knows how it works.
> The idea that threads or processes are intrinsically slow was sheer unmitigated propaganda, and probably not only failed to contain a grain of truth, but are actively false.
Threads / processes:
* Run some code from A
* Save state, context switch
* Run some code from B
* Save state, context switch
* Deal with locking, synchronisation, etc
vs
* Run some code.
There is absolutely no instances where [num threads] > [num cores] is as efficient as not using more threads than cores.
Funny, then, you'd think the benchmarks would show that, if it's so obvious, instead of showing the opposite.
The problem is that once you understand what lies behind your glib "run some code", you understand what the problem is. I mean, for one thing, the idea that in a busy server switching to a different event handler which has neither its code nor its data in any processor cache is not itself a "context switch" is a use of the term not necessarily connected to any reality, even if one might pass Computer Science 302 with that answer. Alas, we can not convince our CPUs or RAM to go any faster by arguing at them that they aren't making a "context switch".
But, you know, it's an open benchmark, and the benchmarks themselves aren't all that complicated. Do feel free to submit your event-based handlers that blow the socks off the competition. Bearing in mind that is the standard you've set here. Merely competitive means you've still lost. Nor do I see any "but benchmarks don't mean anything" wiggle room in your statements, because what you're talking about is exactly what is being benchmarked.
The linked article didn't make this clear, but this feature is mainly designed for process-per-core models, not process-per-connection. The problem you run into with most existing process-per-core systems is that you can't ensure an even distribution of load across the processes without introducing extra overhead. SO_REUSEPORT offers some convenience when changing the number of processes, but the real benefit is that in this mode the kernel uses a better load-balancing scheme.
To hide I/O latency. You cannot do this effectively without threads without implementing your own scheduler, unless your I/O delays are constant and known a priori.
[1]: http://www.techempower.com/benchmarks/