Hacker News new | ask | show | jobs
by FooBarWidget 5797 days ago
An active file descriptor is one that you can read from or write to without blocking or getting EAGAIN as error. The whole point of poll/epoll/kqueue/select is to figure out which file descriptors are in such a state.

The difference between poll and epoll is that, given an input of N file descriptors, poll returns all N file descriptors and you need to loop through each one of them to check whether the 'active' flag is set on there. epoll just returns all the active file descriptors so that you don't need to loop through the inactive ones.

A hybrid approach, as Zed has suggested, would appear to be more efficient on the surface. It remains to be seen whether it can actually be implemented efficiently because migrating fds from/to epoll is extremely expensive, requiring a single syscall per fd.

But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches. Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.

2 comments

I totally absolutely agree they should fix epoll, but the way they've designed I don't see it happening. Of course they could fix the call for doing the actual select and make it at least as fast as poll, but the fact that you have to do a syscall for every file descriptor is idiotic.
Some people actually did fix epoll, benchmarked the results and concluded it wasn't worth it.

http://www.linuxinsight.com/ols2004_comparing_and_evaluating...

Thanks for the detailed explanation. Sounds like I was at least on the right track.

But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches.

That does indeed sound like a better conclusion.

Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.

I don't see a reason why the syscall-per-fd couldn't easily be replaced/augmented with a single mass add/remove syscall which takes an array. The worse performance seems similarly baffling; it almost sounds as if they had some kind of inefficient data structure holding the file descriptor pool; considering poll() uses a flat array and epoll uses set operations I assume it's pretty tricky to make it perform well, even with a hash table. Maybe set operations aren't the best way to handle this data structure; but only some profiling in the kernel code can tell us that.

Obviously it'll take until 2.6.37 at least for any changes to enter the mainstream kernel, and until then a hybrid approach sounds sensible for those unwilling to patch. But still, fixing the root problem seems like a worthwhile cause.

I highly doubt that in production it will make any difference at all. In the end it is not the poll/epoll overhead that determines your overall throughput. If you call poll/epoll more frequently than you should then it starts to add up but in reality one call per several thousand file system operations doing real IO is not going to make a big difference.

Of course all the little bits help and I'm happy to see someone pay attention to detail like this but normally speaking you should get to the point where you're shifting data in real life situations and you can hook up a profiler to make the decision. You have less to blog about like that but the difference between poll and epoll is not large enough that you would spend more time going from the one to the other than was spent analysing this and writing the post.

Optimisations like this are best left to when you have things working, first make it work, then make it fast.

I keep seeing this...misconception? I don't know what to call it, over most comments. The article doesn't talk about optimizations. It talks about design decisions. Someone actually took the time to sit down, ponder what kind of workloads will be handled by his application, came across an interesting dilemma, measured and tested them both, and finally reached an educated conclusion.

That right there is how you correctly choose how to implement things, contrary to popular belief. Think first, write code later.

  Someone actually took the time to sit down, ponder what kind of
  workloads will be handled by his application
The most important point made in this thread is that Zed actually didn't do that, but did benchmarks for a range of workloads, suggesting he intends Mongrel2 to be able to handle all of them. The question is whether that is necessary. If ATR > 0.6 never happens in practice, it will only unnecessarily complicate Mongrel2.
Writing the "which fds are ready" check twice would be crazy, if neither version were markedly more efficient than the other. Optimization is the only reason to make this particular design decision, and measurement is definitely the right way to go about it.
what good thinking gives you when your assumptions are wrong?
You again? Man you got some beef.

Anyway, we do have something working. You can deploy Mongrel2 right now, and I do all the time. You can actually measure how fast it is, although that's going to be slow since we haven't done much tuning so kind of pointless.

So, again you're wrong. We are at a point in the design where this measurement and analysis matters, and we have something that actually works to put it in. You should probably maybe go do some actual reading instead of posting here like you know what you're talking about.

Absolutely, although I can see how it makes sense to look into this stuff for Mongrel2 which seems to do little processing of its own, let alone disk I/O and mostly acts as a kind of demultiplexer.