Hacker News new | ask | show | jobs
by zedshaw 5802 days ago
So let's take your assertions and take them apart:

> the ones on dial up and on congested lines will get you every time.

Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

> They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently

Again, you have no numbers on the active/total ratio in your server, so unless you do this statement doesn't refute what I found. I've presented evidence that just shows the math of O(N=active) / O(N=total) holds up. Simple math. The only way epoll wins for all load types is if it is as fast as poll all the time. My tests show it's not, which stands to reason since it's implemented using more syscalls than poll.

> The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.

Again, you have no definition of "congestion". If you adopt a simple metric like ATR then we can talk. As it is, you (and everyone else) just throws around latency numbers like those matter when really the performance break is in the ATR. In addition, my numbers show the performance break being at about 60% ATR, so if you're saying that no server every goes above 60% activity levels then you're totally wrong. 60% is not completely unreasonable on a loaded server.

But, I think you're missing a key point: You need both in a server like Mongrel2. I never said epoll sucks and poll rocks (since you probably didn't read the article). I said something very exact and measurable:

> epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.

If you don't think that's the case in "the real world" then go measure it and report back. That's the science part. I totally don't believe it yet myself, which is why I'm measuring it and showing the methods to everyone so they can confirm it for me.

3 comments

So, here are the numbers from one of the webservers that I instrumented to log the active-to-total ratio over a couple of hours.

The webserver is custom job called yawwws (yet-another-www-server) that is used to serve up a variety of bits and pieces for a high traffic website, typically the requests are very short in nature (a 500 byte request followed by a < 10K answer).

After about two hours of running the active-to-total ratio varied between 10% to 40% for 5 minute intervals, with the majority of the 5 minute buckets around the 30% mark. I'm actually quite surprised at the spread.

The bigger portion of the time seems to be spent waiting for the clients to send the request, most if not all of the output data should fit in the TCP output buffers, so that actually skews the results upwards, for longer running requests sending more data to the clients the active-to-total ratios would probably be a bit lower.

So 10% to 40% of all the sockets were active at any given time, the rest was idle, waiting for data to be received or for buffer space to be freed up so data could be written.

In this situation epoll would be faster than poll because epoll only sends the user process those fds that it actually has to deal with rather than all of them, so the loop that takes the output of the system call will have less iterations.

So, as I wrote before, I think the typical web server is, when it is dealing with the client facing side more often than not waiting for the client to do something, and it seems that on my server that hasn't changed since I last looked at it.

This server runs with keepalive off. Switching it on will most likely make the active-to-total ratio dramatically lower but I don't feel like pissing off a large number of users just to see how bad it could get. There is a good chance that my socket pool will turn out to be too small to do this without damage.

Chances are that for different workloads the percentages will vary but this setup is fairly typical (single threaded server, all requests served from memory) so I wouldn't expect to see too much variation on different sites, and if there is variation I'd expect it to go down rather than up.

If I get a chance I'll re-run the test on some other websites to see if the numbers come out comparable or are wildly different.

Read "on dial-up" as "slow". The argument depends only on there being a certain distribution of client speeds. It's not about dial-up in particular.
And, if there's a distribution of speeds then you can measure the distribution and see what works best. Again, my challenge still stands:

Measure it or STFU.

You've already got a benchmarking system configured; why not benchmark using a congested (or at least, emulated) pipe?

It doesn't necessarily depend on dial-up, either. Imagine the number of people who leave bittorrent open in the background, stream porn, or whatever else that leaves their individual HTTP connections slow. Hell, latency alone (it takes at least a second for my connection to reach the east coast of the USA) would have an effect, and you can't underestimate the increasing number of mobile devices on slow(-ish, depending on congestion) 3G networks.

I'd provide statistics from my server (I serve an NZ gaming community), but I suspect my numbers would be disproportionate compared to the average workload. Here in NZ, we have far more people on crappy pipes (our DSL network is, famously, a gigantic pile of shit - although that has improved over the past couple of years and continues to), and far less people on smartphones (iPhones cost ~$800USD here).

Still, I believe the commenter has a point which you shouldn't ignore, or at least shouldn't pass off so easily :). I'd love to do some testing myself, but unfortunately between working a day-job, and spending my evenings trying to get a startup off the ground, I've got no time spare.

How about you get measurements of ATR from real-world deployments instead of the wild conjectures you've laid in this thread? Your challenge applies even moreso to yourself:

Measure it or GTFO.

Oh, you mean do what I'm already doing? Measuring and developing ideas then testing them?

It helps if you're going to comment that you actually read the words I use, not the ones you have in your head that make you sound like you're super smart.

Yes, you measured the ideal ATR inflection point for poll vs. epoll in your synthetic microbenchmark.

But you guessed wildly about what ATRs people see in the real world: http://news.ycombinator.com/item?id=1572292 http://news.ycombinator.com/item?id=1572418

Yeah, 'cause there's no way I'll be able to test a real web server that I actually wrote based on this small test. This is a small test to test one specific thing, doing more would confound the test. Confounding. Look it up.

Incidentally, this is the same test everyone else uses, so if you thought it was bullshit why did you support it when people testing epoll with it were using it? Oh, because they used it to confirm your bias rather than disagree with it.

If only you practiced what you preached. Imagine the amount of self-righteous bile that your servers wouldn't serve.
I am practicing what I preach, you stupid 20%-er troll. That's why the whole blog post is full of measurements, testing hypotheses, and the assumption that I might be wrong.

Because unlike you, I actually go do shit rather than spout off in a comment thread.

You may think you are. Far out. Why you think you can violate basic social norms, while others apparently support it, is beyond me.
Basic social norms also involve not saying ruthless comments you wouldn't say to my face.
> Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

He doesn't need to show that it's high, only that it's high enough to cause a significant contingent of ordinary webservers' requests to be lingering slow connections.

I agree, but "high enough" is apparently just 60%. The standing question is, what's the actual level in different kinds of servers?

In other words, I've given a metsric, ATR at 60% is the break even point for poll vs. epoll. So far the only responses I've got haven't even tried to give out a metric, let alone say what their actual ATR is but they claim that it's low.

I'm a scientist, so in the same way I don't believe my own research, I don't believe their rhetoric.

> I'm a scientist

I've never come across a scientist that took criticism of their work the way you do and that responded in the way you do. Shouting down, deriding, insulting and in general being a jerk to those that don't agree with you because 'you're a scientist' is not the way of science.

I dunno, that is how a lot of scientists I know interact with people. :/
That's pretty sad.
I'm shouting you down because you're a FUD slinging troll. Very first thing you did was immediately reply to every branch of the comments with your agenda. I actually have no idea what your problem is, since I'm just presenting some information and working on my own software with it, but you've got some weird "epoll religion" you like to spread.

So, consider me the Richard Dawkins of epoll.

Zed, you might reconsider your social strategy. Your article lays out a brilliant theoretical approach, but it's hard figure out how to apply it without more real-world data. Pragmatically, it seems like your goal should be to encourage others who are in a position to gather this data to do so, and to share it with you.

It looks like Jacques looks has a pretty good start at making these measurements: http://news.ycombinator.com/item?id=1573145. If the numbers he provides aren't helpful, or aren't complete, you might try encouraging him to fix them. Calling him a "FUD slinging troll" seems more likely to cause him to tune out and ignore you, to the detriment of us all.

Realize that you've been thinking about this problem for a while now, while others have just started their thought process. Your goal is to get them up to speed so they can move your argument further, but this won't be instantaneous. Treating them as potential allies during this formative stage might pay off. If you can hold off with the insults for a couple hours or even days, you might get better results than if you shout them down immediately. :)

  Very first thing you did was immediately reply to every branch of the
  comments with your agenda.
You're suffering from paranoia. If you post an article about security, you can bet tptacek is all over the comments, informing and correcting people. In this case, the article was about something jacquesm happens to know a bit about, so he participates actively in the comments. To suggest he is pushing an 'agenda' is ridiculous: there's nothing at stake for him. The only thing he tries to do is help you, by noting that he thinks you have overlooked something.

  I actually have no idea what your problem is [..]
That's because he doesn't have a problem: it's your mind that's filling in the blanks. It suggests that while writing the article, you were already sure people would challenge you based on 'religious conviction' instead of on fact. jacquesm's point was a simple, critical question: what are actual real-world ATR's for the servers that Mongrel2 should be able to replace?

Allow me to make an observation of a psychological nature: you are thoroughly miffed that it was so easy for someone to provide possibly devastating criticism to an idea about which you started caring WAY to much. What you should realize it nobody thinks lesser of you because of that criticism: the article is still interesting and provides a sound basis. There is no reason to react in such a aggressive way; it's even counterproductive.