| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mewse 896 days ago

Worth noting that there's an equivalent of epoll on most platforms.

On Windows there's IOCP, on Mac and BSD-derivates there's kqueue, and Linux has epoll, but to a first approximation they all do basically the same thing; instead of giving you a full-sized array of "active or not" results that you have to iterate across to detect activity on each of your sockets (as you get from the standard berkeley sockets 'select' and 'poll' APIs), they only inform you about the sockets that are actually active, so you can spend less CPU time iterating through that big results array.

I can say from personal experience that when you've got a single server process that's monitoring 25,000 sockets on a Pentium Pro 200mhz box, it makes a huge difference!

I'm a little surprised (and maybe skeptical) that it'd make a noticeable difference for the much smaller number of sockets that your average web server would be using, but.. maybe?

7 comments

jpgvm 895 days ago

They aren't really created equally though.

epoll and kqueue really are just edge-triggered select/poll.

However IOCP and the new io_uring are different beasts, they are completion based APIs vs readiness based.

To quickly explain the difference:

readiness based: tell me all sockets that are ready to be read from

completion based: do this, tell me when you are done

The "tell me when you are done" part is usually handled in the form a message on a queue (or ring buffer, hence the name io_uring, with the u being for userspace). Which also generally means really high scalability of submitting tons of tasks and also processing tons of completions.

Completion based APIs are superior IMO and it was always sad to me that Windows had one and Linux didn't so it's awesome Jens Axboe got his hands dirty to implement it. It beats the pants off of libaio, eventfd, epoll and piles of hacks.

o11c 895 days ago

A point that people seem to miss: epoll supports both level-triggered and edge-triggered. Most similar APIs only support level-triggered.

Edge-triggered is theoretically less work for the kernel than level-triggered, but requires that your application not be buggy. People tend to either assume "nobody uses edge-triggered" or "everybody uses edge-triggered".

Completion-based is far from trivial; since the memory traffic can happen at any time, the kernel has to consider "what if somebody changes the memory map between the start syscall and the end syscall". It complicates the application too, since now you have to keep ownership of a buffer but you aren't allowed to touch it.

AIX and Solaris apparently also support completion-based APIs, but I've never seen anyone actually run these OSes.

(aside, `poll` is the easiest API to use for just a few file descriptors, and `select` is more flexible than it appears if you ignore the value-based API assumptions and do your own allocation)

manwe150 895 days ago

Edge-triggered requires an extra read/write on every epoll relative to level-triggered though because you must exactly trigger reading the error state (EAGAIN), so it actually can be much slower (libuv considered switching at one point, but wasn’t clear the extra syscalls required by edge triggering were worth while)

o11c 895 days ago

Only on reads. For writes you always want to loop until the kernel buffer really is full (remember the kernel can do I/O while you're working). Writes, incidentally, are a case where epoll is awkward since you have to EPOLL_CTL_MOD it every single time the buffer empties/fills (though you should only do this after a full tick of the event loop of course ... but the bursty nature means that you often do have to, thus you get many more syscalls than `select`).

Even for reads, there exist plenty of scenarios where you will get short reads despite more data being available by the time you check. Though ... I wonder if deferring that and doing a second pass over all your FDs might be more efficient, since that gives more time for real data to arrive again?

manwe150 895 days ago

True, I don’t remember the details for writes, and the complexity of managing high/low water marks makes it even trickier for optimal code. And large kernel send buffers here mostly avoid the performance problem here anyways. But on a short write, I am not sure I see the value in testing for EAGAIN over looping through epoll and getting a fresh set of events for everything instead of just this one fd

Right, for reads, epoll will happily tell you if there is more data still there. If the read buffer size is reasonable, short reads should not be common. And if the buffer is huge, a trip around the event loop is probably better at that point to avoid starvation of the other events

marssaxman 895 days ago

> Completion based APIs are superior IMO

Perhaps it's just that I cut my teeth on the classic Mac OS and absorbed its way of thinking, but after using its asynchronous, callback-driven IO API, the multithreaded polling/blocking approach dominant in the Unix world felt like a clunky step backward. I've been glad to see a steady shift toward asynchronous state machines as the preferred approach for IO.

geertj 895 days ago

> Completion based APIs are superior IMO

I probably agree with that, but curious to know what your reasons are.

kqr 896 days ago

> to a first approximation they all do basically the same thing; instead of giving you an array you have to iterate across they inform you about the sockets that are actually active

Well, that's half the story. The other half is that select/poll is stateless, meaning the application–kernel bridge is flooded with data about which events you are interested in, despite the fact that this set usually doesn't change much between calls.

kqueue and the like are stateful instead: you tell the kernel which events you are interested in and then it remembers that.

abhishekjha 896 days ago

Wait, so which is better? Stateful or stateless? How do you decide?

Very new to these APIs and their usages.

kqr 896 days ago

Stateless is simpler to implement and easier to scale across multiple nodes, but comes with additional overhead. When polling the kernel for sockets, the overhead was a bigger cost than implementation complexity and horizontal scaling. (Implementation complexity is still a problem – see the discussions regarding the quality of epoll; and horizontal scaling is just not something desktop kernels do.)

mewse 896 days ago

Good point, yes!

bluetomcat 896 days ago

It's a subscription-based kernel API. Instead of passing roughly the same set of descriptors in each successive iteration of the event loop, you tell the kernel once that you are interested in a given fd. It can then append to the "ready list" as each of the descriptors becomes available, even when you're not waiting on epoll_wait.

With the older select() and poll() interfaces, you have to pass the whole set every time you wait. The kernel has to construct individual wait queues for each fd. Epoll is just a more convenient and efficient API, allowing an efficient implementation on the side of the kernel.

liendolucas 895 days ago

I've tried to use IOCP with Python and the pywin32 module few years ago. I was never able to make it work, even reading c++ code as a guiding source. Also documentation and resources for IOCP when I looked at the time were very scarce and almost looked like an obscure topic to dive in, and finally gave up. On the other side kqueue and epoll are almost trivial to use. If everything fails there is always select which is easy to use as well.

docandrew 895 days ago

Making IOCP work underneath stateful protocols like TLS is more complicated still, and then add on multithreaded handlers and things get messy real quick. It’s a similar story with io_uring. It can be done (I’ve done it) but its not easy to reason about (at least for me, maybe I just don’t have the mental horsepower or proper brain wiring to grasp it easily).

throw0101d 896 days ago

> Worth noting that there's an equivalent of epoll on most platforms.

And libevent if you want a portable front-end to all of them:

* https://libevent.org

* https://en.wikipedia.org/wiki/Libevent

jen20 896 days ago

Or libuv:

* https://libuv.org/

another2another 895 days ago

Or even libdispatch which I've been using on Linux and MacOSX surprisingly well.

graemep 896 days ago

> Worth noting that there's an equivalent of epoll on most platforms.

The article makes that clear.

deaddodo 896 days ago

What a weird thing to get offended by. It mentions kqueue, sure. OP is just noting that it's a near ubiquitous thing nowadays.

Cool your wits.

chrisrhoden 896 days ago

From the article:

> Aside: All of the above work on many operating systems and support API’s other than epoll, which is Linux specific. The Internet is mostly made of Linux, so epoll is the API that matters.

ghusbands 895 days ago

That quote is talking about epoll-using software (Go, nginx and 'most programming languages', including Rust), not polling-based APIs. You should instead quote:

> without BSD’s kqueue (which preceded epoll by two years), we’d really be in trouble because the only alternatives were proprietary (/dev/poll in Solaris 8 and I/O Completion Ports in Windows NT 3.5).

HackerThemAll 896 days ago

> I'm a little surprised (and maybe skeptical) that it'd make a noticeable difference for the much smaller number of sockets that your average web server would be using, but.. maybe?

Are you aware that Linux is used e.g. by Google, Microsoft or Amazon? Can there be occasions when they handle more than 5 simultaneous connections per box? Maybe epoll can make a difference for them, no?

jen20 896 days ago

For every Google there are 50 enterprise shops using a thread per connection in IIS or Tomcat…