A conclusion reached by measurement is not premature. This looks like an attempt to write a better server than the 80/20 rule allows. If he's wrong and only one polling method is useful in production, the live servers will pick the good one and nobody will suffer because he jumped to conclusions. Since he's written Mongrel, I trust that he has a reason to worry about polling that may not have appeared in the post
> A conclusion reached by measurement is not premature.
That's just plain wrong. Premature optimisation does not refer to having to measure before you optimise, it refers to optimising things that in practice may have little or no effect on the actual performance of the program.
By doing these tests in isolation instead of while running on a profiling kernel under production load it is very well possible that the bottleneck will not be the polling code at all but something entirely different. I'd say that this is a textbook example of what premature optimisation is all about.
Assuming you have a finite budget of time to spend on a project any optimisations done that take time out of that budget that could have been spent more effective elsewhere is premature.
Now there is a chance that this would have been the bottleneck in the completed system, but before you've got a complete system you can't really tell. My guess based on real world experience with lots of system level code that used both, including web servers, video servers, streaming audio servers and so on is that the overhead of poll/epoll will be relatively minor compared to other parts of the code and the massive amount of IO that typically follows a poll or an epoll call.
If you have 10K sockets open then typically poll/epoll will return a large number of 'active' descriptors, you'll then be doing IO on all of those for that single call to poll/epoll.
Each of those IO calls is probably going to be as much or more work to process than the poll call was.
Again with this idea that Mongrel2 isn't working. You sir have no freaking idea what you're talking about.
"That's just plain wrong. Premature optimisation does not refer to having to measure before you optimise, it refers to optimising things that in practice may have little or no effect on the actual performance of the program."
No, that's just plain wrong. Premature optimisation is actually implementing something convoluted thinking it's optimized without knowing whether it actually is or not. It's voodoo cargo cult science. It's going against occam's razor.
There's nothing in Mongrel2 that's premature optimized. It's all very simple algorithms chosen for the right task, and later on I'll be testing them to see if they're still right. So your claim that this is premature optimization is just a buzzword and completely offensive. I took a long time to actually test my ideas before implementing them.
That's the total inverse of premature optimization.
What is total voodoo junk science is most of what you say. So far I haven't seen one set of data or any scientific experimental design or even a single testable hypothesis backing what you claim. It's all just rhetoric.
Until you've got hard numbers backing what you say, everything you're saying is inferior to what I'm doing: science.
So, if it's working why not throw a load of real world traffic at it in stead of this 'science' that you're performing here ?
After all, that is where the rubber meets the road and it would be a very easy way to determine if your hunch is right or not.
Epoll was specifically created with that sort of workloads in mind, your 'surprising' conclusion is not rooted in the fact that epoll is somehow behaving in a way that is contrary to expectation, in fact it behaves exactly as it is designed to do.
Benchmarking it like this is nothing like the real world, and that's where epoll shines, not when you test it the way you just did.
As for the numbers, we're serving about 10Gbps continuously using a combination of varnish and java code to several million uniques daily, html, images, video. Poll over epoll is a run race, as far as I'm concerned you're wasting your time with this.
But by all means, ignore all this and do what you have to, those are the lessons learned best anyway, and it's your time, not mine.
If you feel like getting another view on this I'd suggest to contact the author of Varnish, he really knows his stuff and he might be able to convince you where I can not.
Because "real world traffic" is a bullshit test. Who's real world traffic? Mine? Yours? Google's? Yahoo's? Science is repeatable, which means you can replicate the same inputs to the test every time. The lab experiments involved in science are often highly sanitized and not "real world" at all. Determining what the results mean in the real world comes later.
If you're going to complain about science, at least understand how it works.
There are basically three sorts of traffic that I have experience with and would expect to be the major portion of whatever goes over the web:
- underwater ajax requests
- regular website content (images, dynamic html, css and other relatively small (say < 250K) files)
- media servers (filedumps, video servers, streaming audio servers)
Each of those requires fairly specific tuning of the TCP stack to get the most out of it, so you're not likely going to find all of these on one and the same machine unless it is a small operation (and in that case this whole discussion is moot).
A benchmark done in isolation is meaningless because in the end, real world traffic is what it is all about. So, I personally don't care whose site(s) you test with, as long as there are enough of them to get a statistically valid result.
Google's or Yahoo's would be fine with me, I've given my results above, if I have the time I'll do the same thing on a couple of other high volume sites.
I've (unfortunately) studied this problem quite a bit because of the size of the websites that I'm involved with and so far I've learned that you can play around on your testbench all day long it doesn't matter one little bit for production purposes unless you are very careful (such as in that other test linked from this page) to simulate users clients.
You could do a lot worse than to play back a log file in order to make an experiment repeatable. I assume that real world performance is what Zed is after, not theoretical performance.
Because "real world traffic" is a bullshit test [..]
The hell it is. A statistically significant sample of 'real world ...' is the foundation for most engineering decisions. When you build a bridge, you take the actual loads it has to support into account. Intel bases their chipbaking on the actual purity of the silicon their suppliers can provide.
Exactly, there's no concept of confounding at all. You use "real world tests" (whatever the hell that is) when you have an actual specific setup to test. You use a small model experiment like this to test one specific thing like poll vs. epoll.
I think what he's saying is that it should be pretty easy to change between poll, epoll, and "superpoll", as you can easily abstract a common interface. Then, you could worry about the performance of that particular bit only when you encounter it, and use your time arguably more effectively on actual feature needed to productionize Mongrel2 better.
I sort of agree with him, except with an important detail: this question is basically about prioritization of your time, and I'd say that this is nobody's business but yours! You can optimize memcpy() all day, for all I care. ;-)
There's one aspect where you'd be quite right to do some investigation before implementing: if the outcome changes the interface you'd need to implement.
For example, here's an hypothesis: using epoll's edge-triggered mode could drastically reduce the number of events (since you only get an event when an fd becomes readable/writable the first time, instead of every time it's in that state). Since epoll is O(N) on the number of events returned (not on the number of fds that are currently readable/writable), you'd lower the effective ATR a whole lot. In fact, a really busy server would have fewer events, since a readable fd would stay that way for longer if data is received at a great rate (the write-side story might be less brilliant). You'd also have to do much fewer calls to epoll_ctl, since you could just stop caring about the reading side while you're trying to write the last batch of data on the other side (no need to remove it from the interest set, you won't get events for it). You only need to set it when flipping from read to write, and the other way (after receiving/sending headers and bodies).
Now, if that's true, that's a big deal, because now you have to change your design a fair bit. You have to remember that an fd is readable until you get EAGAIN from read() yourself, so there's some more state management, moving that fd from one list to another, etc. Finding out that this would be a million times faster (or slower!) now would save you a ton of work, either way.
But finding whether poll or epoll is faster, or an hybrid solution with the same interface? Meh, it could wait.
(about my hypothesis, that's actually what Davide Libenzi designed epoll for, which might explain some of the weirder bits)
"I need polling" => "Here are my options, which one is better?" => "They're good for different things" => "I'll pick the best one for the environment" is a reasonable design process. More so than some decisions that I make! Yes, there's a fixed time budget. But you're suggesting selectively ignoring evidence when designing a program, preferring random guessing and pattern matching to actual numbers. Should he have collected them to begin with? Maybe not, but sometimes you can't help your curiosity on a hobby project :). I understand your concern about this hypothetical production system, but the fact of the matter is that there is no production system right now, and no way to measure how it will handle certain things, but there are benchmark numbers. Better than nothing, I say!
Which, in reality is, "I'll spend a lot of design and implementation effort designing a new one which may or may not improve the measurable, global performance of my new web server because it's not yet at the point where I can benchmark these sorts of things to verify that I'm not wasting a whole ton of effort that could be better spent by deciding that epoll is fast enough."
Maybe Zed knows from his previous server experience that {e}poll is where he hits a bottleneck; it's just that if there's any chance that it's not, he could be wasting a bunch of time implementing "superpoll".
(Or maybe he just wants to do it because it's neat, or because it's innovative (which it is), or for any number of other reasons. I'm just pointing out that he's doing much more than picking "the best one for the environment")
It's an idea I had after actually measuring. If it doesn't work then I tried something out.
What you really should be getting from it though is that epoll is not faster. It is not O(1). It is not faster on smaller vs. larger lists of FDs. Pretty much all the things you were told as advantages of epoll are total crap.
The only advantage of epoll is it's O(N=active) when poll is O(N=total). That's it.
So at a minimum I've done some education and spent some time learning something.
The funny thing is that he's either going to pick the wrong solution for the workload or spend a lot of time on creating a hybrid which will work as good as epoll would without augmentation.
Zed is clearly out to change the world and I would very much like him to succeed but he just seems to be missing the obvious here, which is that idle connections are the ones to worry about (because they're very expensive!) so his benchmark at this point in time is useless.
First off, again you miss the point that I've already got poll working in Mongrel2. If this fails (which I'll know since I like, measure stuff rather than care a bottle of cheetah blood around like you) then I'll just go put epoll in there or leave it with poll.
But hey, if I don't make it back alive from my complete dangerous experiment in disproving that epoll is always the way to go always you can come get me. Bring a big gun because this stuff is so scawy and howwible I might not make out alive.
"I need a filesystem" => "Here are my options, which one is better?" => "They're good for different things"...
STOP
For most things, it doesn't matter. A filesystem is a filesystem.
You only need to make decisions like that when you properly measure and decide that X may be a bottleneck, or you need features that Y has but X doesn't.
In nearly every large system I've been involved in designing, a "filesystem is a filesystem" would give you a great chance of rendering the system non-functional or performing so bad it might as well be non-functional when making simple design decisions based on knowledge about the problem made it easy to avoid in the first place.
Premature optimization is a sin, but making design decisions you know from experience has a major impact is not, as long as you measure to confirm afterwards.
We don't design by throwing dice - most decisions we make are based on experience or assumptions about what works and what doesn't. Measurements are important to challenge those assumptions, but it doesn't take away the value of making use of experience to create a reasonable baseline.
Where "premature optimization" comes in is where you start expending unnecessary extra effort to implement a more complicated solution without evidence to back up the need for it.
Spending a little bit of time to think through the requirements for major aspects of your system is not extra effort.
If you have 10K sockets open then typically poll/epoll will return a large number of 'active' descriptors, ...<snip>
That's half true - it doesn't hold for low ATR traffic (lots of hanging connections, clients that GET something, spend time elsewhere while in the meantime the browser keeps the connection alive). In short, there's nothing typical about it because, while those two kinds of loads have been studied extensively in both bibliography and practice, their combination and the practical consequences are not well understood, afaik. Links to relevant studies are more than welcome, of course.
Large in an absolute sense, say 3K for a pool of 10K sockets. That's a sizeable number of active connections to deal with after a single system call. Typically for each of those fds you'll then do a read or a write. So the epoll/poll overhead is fairly small, with epoll coming out a bit faster than poll in that situation.
FWIW, I've written an async webserver which handles a few thousand concurrent users, and does thousands of HTTP reqs/second. I've never seen poll/epoll as a bottleneck, but I'm using Java NIO which really seems to work extremely well (I don't remember how/which it uses, I think epoll).
Maybe Java does some of this cool stuff already so perhaps I'm shielded from the pain of dealing with things directly.
In the past I've written Java NIO code that dealt with around 60,000 concurrent connections pretty well. The time spent doing poll seemed to be completely insignificant. CPU usage was negligible.
It'd be good to see some numbers though - for example:
For average mongrel application, 40% of CPU time is spent in poll / average of 30ms latency is due to poll etc.
But I'm skeptical those numbers are true. That was my point.
If you don't start with those numbers and measurements, optimizations like this, whilst interesting, may end up being of no real use to anyone.
Looks like he's already written a large bulk of his server's code, so maybe this optimization isn't really premature :)
You're probably right that when you actually use Mongrel2 as your app server your app-specific code higher up will be a larger bottleneck, but that's code that you have to deal with and this is code that he has to deal with so optimizing the hell out of it doesn't sound like a bad idea.
Lets say 95% of time is in your code, and 5% is in Mongrel2.
Lets say that within Mongrel2, 10% of time is in this poll/epoll stuff.
That's 0.5% of your total time being spent here. So even if it's made twice as fast, your app will only speed up by say 100ms -> 99.75ms
Find the big things that matter and optimize them. Adding extra complexity to small things that don't matter is a recipe for more bugs and more issues.
I haven't actually used Mongrel2 yet, but I get the impression you can use it as a front-end for multiple physical servers, in which case the relative load of the M2 process on its server increases. Besides, unless Zed is writing your application and this is eating away at his time for that, it's not clear what exactly your complaint is.
Any complexity introduced in the code increases its long term maintenance cost. I strongly suspect this is one of those cases where the performance gain will not justify the long-term effort of maintaining a more complex architecture.
I remember having a similar discussion circa 1992 about advantages and disadvantages of using ODBC versus native MS SQL/Sybase libraries. I instrumented the program I was writing and showed it spent 99+% of the time idling, 1-% of the time computing and, of that time, about 78% waiting for the database to return something. Using native libraries would yield a minuscule improvement at the cost of a huge headache.
So, you're comparing my track record with writing simple maintainable well documented code to something you did in 1992 with ODBC? That's your experience that's causing all the paranoia?
The worst two things that afflict programmers today is:
1. They never update their information, even after 18 years (18! You realize that right!? Things change man!)
2. They have an irrational paranoia about trying new things, as if me trying this out is going to destroy the universe.
1) Will you maintain Mongrel forever? It's not your track record that's the question, but the one of all future maintainers of Mongrel that will have to deal with the added complexity this change creates.
2) The experience from 1992 still seems current. Adding complexity to any software project adds cost to maintain it in the future. My experience in 1992 showed how added complexity for a marginal performance gain did not pay off then and still won't pay off today (unless you are programming a computer so expensive even a marginal increase in performance means lots of money).
3) It's your project and you may do with it whatever pleases you. What I wrote was intended as friendly advice from someone who is in this business for a long time. You are, of course, free not to accept the advice.
4) I encourage you to try new things and I am usually the first to propose workload-adaptable solutions. I, however, had my share of extremely clever optimizations that bit me back later when things as subtle as processor caches changed and it's not very funny (albeit it is fun to dig deep in the system to find out why X runs 33% slower on the 50% faster box). Nowadays, I consider every program line not written a line gained.
> Is the extra complexity and logic really going to be a net win?
Fortunately, Zed is the right guy to find this out. I'm certainly looking forward to the results of this--which I bet we'll have an initial answer to by tomorrow.