Hacker News new | ask | show | jobs
by pmjordan 5802 days ago
Pardon my ignorance, I haven't built high performance servers at this low a level, but I'm intrigued:

What exactly is the definition of an "active" file descriptor in this context?

My best guess after reading the man pages is that poll() takes an array of file descriptors to monitor and sets flags in the relevant array entries, which your code then needs to scan linearly for changes, whereas epoll_wait() gives you an array of events, thus avoiding checking file descriptors which haven't received any events. Active file descriptors would therefore be those that did indeed receive an event during the call.

EDIT: thanks for pointing out Zed's "superpoll" idea. I somehow completely missed that paragraph in the article, which makes the following paragraph redundant.

If this is correct, it sounds to me (naive as I am) as if some kind of hybrid approach would be the most efficient: stuff the idling/lagging connections into an epoll pool and add the pool's file descriptor to the array of "live" connections you use with poll(). That of course assumes you can identify a set of fds which are indeed most active.

3 comments

An active file descriptor is one that you can read from or write to without blocking or getting EAGAIN as error. The whole point of poll/epoll/kqueue/select is to figure out which file descriptors are in such a state.

The difference between poll and epoll is that, given an input of N file descriptors, poll returns all N file descriptors and you need to loop through each one of them to check whether the 'active' flag is set on there. epoll just returns all the active file descriptors so that you don't need to loop through the inactive ones.

A hybrid approach, as Zed has suggested, would appear to be more efficient on the surface. It remains to be seen whether it can actually be implemented efficiently because migrating fds from/to epoll is extremely expensive, requiring a single syscall per fd.

But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches. Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.

I totally absolutely agree they should fix epoll, but the way they've designed I don't see it happening. Of course they could fix the call for doing the actual select and make it at least as fast as poll, but the fact that you have to do a syscall for every file descriptor is idiotic.
Some people actually did fix epoll, benchmarked the results and concluded it wasn't worth it.

http://www.linuxinsight.com/ols2004_comparing_and_evaluating...

Thanks for the detailed explanation. Sounds like I was at least on the right track.

But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches.

That does indeed sound like a better conclusion.

Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.

I don't see a reason why the syscall-per-fd couldn't easily be replaced/augmented with a single mass add/remove syscall which takes an array. The worse performance seems similarly baffling; it almost sounds as if they had some kind of inefficient data structure holding the file descriptor pool; considering poll() uses a flat array and epoll uses set operations I assume it's pretty tricky to make it perform well, even with a hash table. Maybe set operations aren't the best way to handle this data structure; but only some profiling in the kernel code can tell us that.

Obviously it'll take until 2.6.37 at least for any changes to enter the mainstream kernel, and until then a hybrid approach sounds sensible for those unwilling to patch. But still, fixing the root problem seems like a worthwhile cause.

I highly doubt that in production it will make any difference at all. In the end it is not the poll/epoll overhead that determines your overall throughput. If you call poll/epoll more frequently than you should then it starts to add up but in reality one call per several thousand file system operations doing real IO is not going to make a big difference.

Of course all the little bits help and I'm happy to see someone pay attention to detail like this but normally speaking you should get to the point where you're shifting data in real life situations and you can hook up a profiler to make the decision. You have less to blog about like that but the difference between poll and epoll is not large enough that you would spend more time going from the one to the other than was spent analysing this and writing the post.

Optimisations like this are best left to when you have things working, first make it work, then make it fast.

I keep seeing this...misconception? I don't know what to call it, over most comments. The article doesn't talk about optimizations. It talks about design decisions. Someone actually took the time to sit down, ponder what kind of workloads will be handled by his application, came across an interesting dilemma, measured and tested them both, and finally reached an educated conclusion.

That right there is how you correctly choose how to implement things, contrary to popular belief. Think first, write code later.

  Someone actually took the time to sit down, ponder what kind of
  workloads will be handled by his application
The most important point made in this thread is that Zed actually didn't do that, but did benchmarks for a range of workloads, suggesting he intends Mongrel2 to be able to handle all of them. The question is whether that is necessary. If ATR > 0.6 never happens in practice, it will only unnecessarily complicate Mongrel2.
Writing the "which fds are ready" check twice would be crazy, if neither version were markedly more efficient than the other. Optimization is the only reason to make this particular design decision, and measurement is definitely the right way to go about it.
what good thinking gives you when your assumptions are wrong?
You again? Man you got some beef.

Anyway, we do have something working. You can deploy Mongrel2 right now, and I do all the time. You can actually measure how fast it is, although that's going to be slow since we haven't done much tuning so kind of pointless.

So, again you're wrong. We are at a point in the design where this measurement and analysis matters, and we have something that actually works to put it in. You should probably maybe go do some actual reading instead of posting here like you know what you're talking about.

Absolutely, although I can see how it makes sense to look into this stuff for Mongrel2 which seems to do little processing of its own, let alone disk I/O and mostly acts as a kind of demultiplexer.
It's actually a really simple concept what's "active" in poll vs. epoll. Your call to poll and epoll basically looks like this:

active_fds = poll(big_ass_array_of_fds, total_fds)

epoll is slightly different but same concept. You have a total number of FDs you're want to know about, and each call returns a number that have had activity.

And that's it. You then just do active_fds/total_fds and that gives the ATR. If this is < 0.6 after your call to poll, then that call to poll would have been better done with epoll. If the active_fd/total_fd is > 0.6 then it's better to stick with poll.

Of course, it's more complicated than that, but this gives you a simple metric of the break point where one is better than another.

An active file descriptor is a filedescriptor that you want to read from that has data available and one that you want to write to that has buffer space available.

To put it in another way, if you were to use blocking IO then an operation on an active descriptor would not block. Of course poll and epoll are all about asynchronous IO (so non-blocking by definition) but that's a good way to describe the difference.

Zed's 'superpoll' is precisely what you suggest.

Thanks for the explanation, I didn't think of the part about the socket being free for writing vs. whether there was data available.

Zed's 'superpoll' is precisely what you suggest.

Facepalm. Thanks, I mysteriously missed that part of the article.