Hacker News new | ask | show | jobs
by jacquesm 5797 days ago
In real-life web serving situations, and not in benchmarks, the majority of the fds is not active. It's the slow guys that kill you.

A client on a fast connection will come in and will pull the data as fast as the server can spit it out, keeping the process and the buffers occupied for the minimum amount of wall clock time and the number of times the 'poll' cycle is done is very small.

But the slowpokes, the ones on dial up and on congested lines will get you every time. They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently, first to see if they've finally completed sending you a request, then to see if they've finally received the last little bit of data that you sent them.

The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.

4 comments

So let's take your assertions and take them apart:

> the ones on dial up and on congested lines will get you every time.

Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

> They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently

Again, you have no numbers on the active/total ratio in your server, so unless you do this statement doesn't refute what I found. I've presented evidence that just shows the math of O(N=active) / O(N=total) holds up. Simple math. The only way epoll wins for all load types is if it is as fast as poll all the time. My tests show it's not, which stands to reason since it's implemented using more syscalls than poll.

> The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.

Again, you have no definition of "congestion". If you adopt a simple metric like ATR then we can talk. As it is, you (and everyone else) just throws around latency numbers like those matter when really the performance break is in the ATR. In addition, my numbers show the performance break being at about 60% ATR, so if you're saying that no server every goes above 60% activity levels then you're totally wrong. 60% is not completely unreasonable on a loaded server.

But, I think you're missing a key point: You need both in a server like Mongrel2. I never said epoll sucks and poll rocks (since you probably didn't read the article). I said something very exact and measurable:

> epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.

If you don't think that's the case in "the real world" then go measure it and report back. That's the science part. I totally don't believe it yet myself, which is why I'm measuring it and showing the methods to everyone so they can confirm it for me.

So, here are the numbers from one of the webservers that I instrumented to log the active-to-total ratio over a couple of hours.

The webserver is custom job called yawwws (yet-another-www-server) that is used to serve up a variety of bits and pieces for a high traffic website, typically the requests are very short in nature (a 500 byte request followed by a < 10K answer).

After about two hours of running the active-to-total ratio varied between 10% to 40% for 5 minute intervals, with the majority of the 5 minute buckets around the 30% mark. I'm actually quite surprised at the spread.

The bigger portion of the time seems to be spent waiting for the clients to send the request, most if not all of the output data should fit in the TCP output buffers, so that actually skews the results upwards, for longer running requests sending more data to the clients the active-to-total ratios would probably be a bit lower.

So 10% to 40% of all the sockets were active at any given time, the rest was idle, waiting for data to be received or for buffer space to be freed up so data could be written.

In this situation epoll would be faster than poll because epoll only sends the user process those fds that it actually has to deal with rather than all of them, so the loop that takes the output of the system call will have less iterations.

So, as I wrote before, I think the typical web server is, when it is dealing with the client facing side more often than not waiting for the client to do something, and it seems that on my server that hasn't changed since I last looked at it.

This server runs with keepalive off. Switching it on will most likely make the active-to-total ratio dramatically lower but I don't feel like pissing off a large number of users just to see how bad it could get. There is a good chance that my socket pool will turn out to be too small to do this without damage.

Chances are that for different workloads the percentages will vary but this setup is fairly typical (single threaded server, all requests served from memory) so I wouldn't expect to see too much variation on different sites, and if there is variation I'd expect it to go down rather than up.

If I get a chance I'll re-run the test on some other websites to see if the numbers come out comparable or are wildly different.

Read "on dial-up" as "slow". The argument depends only on there being a certain distribution of client speeds. It's not about dial-up in particular.
And, if there's a distribution of speeds then you can measure the distribution and see what works best. Again, my challenge still stands:

Measure it or STFU.

You've already got a benchmarking system configured; why not benchmark using a congested (or at least, emulated) pipe?

It doesn't necessarily depend on dial-up, either. Imagine the number of people who leave bittorrent open in the background, stream porn, or whatever else that leaves their individual HTTP connections slow. Hell, latency alone (it takes at least a second for my connection to reach the east coast of the USA) would have an effect, and you can't underestimate the increasing number of mobile devices on slow(-ish, depending on congestion) 3G networks.

I'd provide statistics from my server (I serve an NZ gaming community), but I suspect my numbers would be disproportionate compared to the average workload. Here in NZ, we have far more people on crappy pipes (our DSL network is, famously, a gigantic pile of shit - although that has improved over the past couple of years and continues to), and far less people on smartphones (iPhones cost ~$800USD here).

Still, I believe the commenter has a point which you shouldn't ignore, or at least shouldn't pass off so easily :). I'd love to do some testing myself, but unfortunately between working a day-job, and spending my evenings trying to get a startup off the ground, I've got no time spare.

How about you get measurements of ATR from real-world deployments instead of the wild conjectures you've laid in this thread? Your challenge applies even moreso to yourself:

Measure it or GTFO.

Oh, you mean do what I'm already doing? Measuring and developing ideas then testing them?

It helps if you're going to comment that you actually read the words I use, not the ones you have in your head that make you sound like you're super smart.

Yes, you measured the ideal ATR inflection point for poll vs. epoll in your synthetic microbenchmark.

But you guessed wildly about what ATRs people see in the real world: http://news.ycombinator.com/item?id=1572292 http://news.ycombinator.com/item?id=1572418

If only you practiced what you preached. Imagine the amount of self-righteous bile that your servers wouldn't serve.
I am practicing what I preach, you stupid 20%-er troll. That's why the whole blog post is full of measurements, testing hypotheses, and the assumption that I might be wrong.

Because unlike you, I actually go do shit rather than spout off in a comment thread.

You may think you are. Far out. Why you think you can violate basic social norms, while others apparently support it, is beyond me.
> Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

He doesn't need to show that it's high, only that it's high enough to cause a significant contingent of ordinary webservers' requests to be lingering slow connections.

I agree, but "high enough" is apparently just 60%. The standing question is, what's the actual level in different kinds of servers?

In other words, I've given a metsric, ATR at 60% is the break even point for poll vs. epoll. So far the only responses I've got haven't even tried to give out a metric, let alone say what their actual ATR is but they claim that it's low.

I'm a scientist, so in the same way I don't believe my own research, I don't believe their rhetoric.

> I'm a scientist

I've never come across a scientist that took criticism of their work the way you do and that responded in the way you do. Shouting down, deriding, insulting and in general being a jerk to those that don't agree with you because 'you're a scientist' is not the way of science.

I dunno, that is how a lot of scientists I know interact with people. :/
That's pretty sad.
I'm shouting you down because you're a FUD slinging troll. Very first thing you did was immediately reply to every branch of the comments with your agenda. I actually have no idea what your problem is, since I'm just presenting some information and working on my own software with it, but you've got some weird "epoll religion" you like to spread.

So, consider me the Richard Dawkins of epoll.

Zed, you might reconsider your social strategy. Your article lays out a brilliant theoretical approach, but it's hard figure out how to apply it without more real-world data. Pragmatically, it seems like your goal should be to encourage others who are in a position to gather this data to do so, and to share it with you.

It looks like Jacques looks has a pretty good start at making these measurements: http://news.ycombinator.com/item?id=1573145. If the numbers he provides aren't helpful, or aren't complete, you might try encouraging him to fix them. Calling him a "FUD slinging troll" seems more likely to cause him to tune out and ignore you, to the detriment of us all.

Realize that you've been thinking about this problem for a while now, while others have just started their thought process. Your goal is to get them up to speed so they can move your argument further, but this won't be instantaneous. Treating them as potential allies during this formative stage might pay off. If you can hold off with the insults for a couple hours or even days, you might get better results than if you shout them down immediately. :)

  Very first thing you did was immediately reply to every branch of the
  comments with your agenda.
You're suffering from paranoia. If you post an article about security, you can bet tptacek is all over the comments, informing and correcting people. In this case, the article was about something jacquesm happens to know a bit about, so he participates actively in the comments. To suggest he is pushing an 'agenda' is ridiculous: there's nothing at stake for him. The only thing he tries to do is help you, by noting that he thinks you have overlooked something.

  I actually have no idea what your problem is [..]
That's because he doesn't have a problem: it's your mind that's filling in the blanks. It suggests that while writing the article, you were already sure people would challenge you based on 'religious conviction' instead of on fact. jacquesm's point was a simple, critical question: what are actual real-world ATR's for the servers that Mongrel2 should be able to replace?

Allow me to make an observation of a psychological nature: you are thoroughly miffed that it was so easy for someone to provide possibly devastating criticism to an idea about which you started caring WAY to much. What you should realize it nobody thinks lesser of you because of that criticism: the article is still interesting and provides a sound basis. There is no reason to react in such a aggressive way; it's even counterproductive.

Yeah, but there's a fetishization of "high concurrency" (being able to support a huge number of connections) rather than absolute performance.

For instance, you might have a system which has a latency of 1 second, and at a given workload, you have 10,000 connections. In the Java culture, people think you're a genius if you can increase those connections to 100,000 and increase the latency to 10 seconds.

End users, on the other hand, would be happier if you cut the latency to 0.1 seconds, but there are a lot of people who'll then think you're a loser who can only manage to handle 1000 concurrent connections.

Of course, getting that latency down is a holistic process that requires you to think about the client, the server, and what exactly goes over the wire.

If you could increase the number of connections to 100,000 you would indeed be a genius because when you bind to a network interface using IPV4 there is a hard limit of the short integer used to indicate the port number which automatically limits you to 65536 connections (actually a few less, usually you'll lose 3 for stdin,stdout and stderr (which you can close to reuse them) and one for the listen socket).

As far as I know the only way around this is to use multiple IPS (possibly aliases on the same interface) but that would still require a new process.

So even if your per-process limit for fds can be larger than 64K the network layer or the mapper that turns fds in to socket ids for the network stack to work with may impose a restriction. I don't know enough about the linux kernel to figure out what exactly causes this.

I use the 64K limit on some high throughput machines (mostly video and image servers), but when I go over that I need to start another process. Possibly there's a way around that but the expense of another process is fairly small so I haven't put in much time to see if I can work around that. Socket to fd mapping presumably takes in to account the address as well as the port so it shouldnt't be a problem but on the kernel of the machines where I have to resort to these tricks it appears to be a limit.

Maybe someone with more knowledge of the guts of the linux kernel can point out why this happens.

TCP connections are identified by the (src ip, src port, dest ip, dest port) tuple. The server only needs one port. So theoretically a server can handle 64k connections per client.
You can see this in the 1M connection test done here: http://www.metabrew.com/article/a-million-user-comet-applica... Look at the "Turning it up to 1 Million" section where he details the need to use 17 IPs for the client side.
Yeah, and that's on the client side, as is indicated by the first sentence of that section:

Creating a million tcp connections from one host is non-trivial.

The key words being "from one host". With a single client machine connecting to a single server endpoint, the (src ip, src port, dest ip, dest port) is reduced to being unique only on src port (from the client's perspective), so that's where the 65k limit, and the need for more IPs to do that, comes from. Using multiple source IPs on the same machine is like using multiple client hosts.

...using IPV4 there is a hard limit of the short integer used to indicate the port number which automatically limits you to 65536 connections (actually a few less, usually you'll lose 3 for stdin,stdout and stderr (which you can close to reuse them) and one for the listen socket).

The file descriptor limit is independent of the 65k total possible source ports. The source port limit is part of TCP/UDP. The file descriptor limit is set by ulimit (nofile in limits.conf) on a per-process basis and in /proc for system-wide. If you need more file descriptors, you can reuse 0, 1 and 2, but that's going to free up some ports so a single process can make more connections to the same server endpoint.

Now that is a test. Thanks for posting that, it is the most interesting thing I've seen all day.
But, a server can have multiple IPS, so a server should be able to handle more than 64K connections from multiple clients without a problem. In practice there appears to be some kind of limit.
The server doesn't need multiple IPs to handle > 65535 connections. All the server connections to a given IP are to the same port. For a given client, the unique key for an http connection is (client-ip, PORT, server-ip, 80). The only number that can vary is PORT, and that's a value on the client. So, the client is limited to 65535 connections to the server. But, a second client could also have another 65K connections to the same server-ip:port.

edit: You may be limited by number of open sockets or file handles. It's likely a per-process limit. Google or some linux guru could help you track down what limit it actually is, but it's not the number of server ports available. It might be a number you could raise.

Right, that makes perfect sense. But it really makes me wonder why I run in to that hard limit, I've tried just about everything to get around it and no matter what I do that seems to be the magic number.

I should go and do some testing to see what's causing this, you make me feel like the solution is right around the corner.

re. your edit, ulimit will happily raise the number > 64K, all the /proc/* settings seem to be ok so that's not it, it has to be some other layer in the stack that causes this. I'll definitely spend some time on this, it's been bugging me for a long time.

edit2: there seems to be a max_user_watches upper limit to what epoll will handle.

For simple testing purposes, it is easy to set up a forwarding proxy that drops n% of the packets it receives - for some high value of n. The World Wide Web is far more sadistic, but it still uncovers some performance or usability problems that are invisible over normal `localhost` traffic. I bet you can also use web servers with traffic shaping to mimic lots of slow connections at once, but I haven't tried that
Mongrel2 is supposed to handle WebSockets as well as HTTP, so I think open connections with sporadic traffic are a use case Zed has to worry about.