Hacker News new | ask | show | jobs
by geocar 3668 days ago
The cool thing is that by gifting buffers to the kernel as policy, the gets to make that choice.

In Linux, you don't, and userspace and kernelspace both end up doing unnecessary copying, and it means with shared buffers, something might poll and not-block, but actually block by the time you get around to using the buffer.

This is annoying, and it generally means you need more than two system calls on average for every IO operation in performance servers.

As a rule, you can generally detect "design faults" by the number of competing and overlapping designs (select, poll, epoll, kevent, /dev/poll, aio, sigio, etc, etc, etc). I personally would have preferred a more fleshed out SIGIO model, but we got what we got...

One specific fault of epoll (compared to near-relatives) is that the epoll_data_t is very small. In Kqueue you can store both the file descriptor with activity (ident) as well as a few bytes of user data. As a result, people use a heap pointer which causes an extra stall to memory. Memory is so slow...

1 comments

> something might poll and not-block, but actually block by the time you get around to using the buffer

On Linux if a socket is set to non-blocking it will not block. I don't really understand your point with shared buffers. You wouldn't typically share a TCP socket since that would result in unpredictable splitting/joining of data.

> In Linux, you don't, and userspace and kernelspace both end up doing unnecessary copying

I'm not so sure the Linux design where copies are done in syscalls must be inherently less efficient. I'm pretty sure with either design, you generally need at least one memcpy - for RX, from the in-kernel RX buffers to the user memory, and for TX, from user memory to the in-kernel TX buffers. I think getting rid of either copy is extremely hard and would need extremely smart hardware, especially the RX copy (because the Ethernet hardware would need to analyze the packet and figure out where the final destination of the data is!). Getting rid of TX copy might be easier but still hard because it'd need complex DMA support on the Ethernet card that could access potentially unaligned memory addresses. On the other hand, I also don't think you need more than one copy, if you design the network stack with that in mind.

> you need more than two system calls on average for every IO operation in performance servers.

True but it's not obvious that this is a performance bottleneck. Consider that a single epoll wait can return many ready sockets. I think theoretically it would hurt latency rather than throughput.

> As a rule, you can generally detect "design faults" by the number of competing and overlapping designs.

On Linux, I think for sockets, there are only: blocking, select, poll, epoll. And the latter three are just different ways to do the same thing. On Windows, it's much more complicated - see this list of different methods to use sockets (my own answer): http://stackoverflow.com/questions/11830839/when-using-iocp-...

> One specific fault of epoll (compared to near-relatives) is that the epoll_data_t is very small.

Pretty much universally when you're dealing with a socket, you have some nontrivial amount of data associated with it that you will need to access when it's ready for I/O, typically a struct which at least holds the fd number. Naturally you put a pointer to such a struct into the epoll_data_t. I don't see how one could do it more efficiently outside of very specialized cases.

> I'm not so sure the Linux design where copies are done in syscalls must be inherently less efficient.

Windows overlapped IO can map the user buffer directly to the network hardware, which means that in some situations there will be zero copies on outbound traffic.

> especially the RX copy I also don't think you need more than one copy, if you design the network stack with that in mind.

When the interrupt occurs, the network driver is notified that the DMA hardware has written bytes into memory. On Windows, it can map those pages directly onto the virtual addresses where the user is expecting it. This is zero copies, and just involves updating the page tables.

This works because on Windows, the user space said when data comes in, fill this buffer, but on Linux the user space is still waiting on epoll/kevent/poll/select() -- it has only told the kernel what files it is interested in activity on, and hasn't yet told the kernel where to deposit the next chunk of data. That means the network driver has to copy that data onto some other place, or the DMA hardware will rewrite it on the next interrupt!

If you want to see what this looks like, I note that FreeBSD went to a lot of trouble to implement this trick using the UNIX file API[0]

> On Linux, I think for sockets, there are only: blocking, select, poll, epoll. And the latter three are just different ways to do the same thing.

Linux also supports SIGIO[1], and there are a number of aio[2] implementations for Linux.

epoll is not the same as poll: Copying data in and out of the kernel costs a lot, as can be seen by any comparison of the two, e.g. [3]

Also worth noting: Felix observes[4] SIGIO is as fast as epoll.

> I don't see how one could do it more efficiently

Dereferencing the pointer causes the CPU to stall right after the kernel has transferred control back into user space, while the memory hardware fetches the data at the pointer. This is a silly waste of time and of precious resources, considering the process is going to need the file descriptor and it's user data in order to schedule the IO operation on the file descriptor.

In fact, on Linux I get more than a full percent improvement out of putting the file descriptor there, instead of the pointer, and using a static array of objects aligned for cache sharing.

For more on this subject, you should see "what every programmer should know about memory"[4].

[0]: http://people.freebsd.org/~ken/zero_copy/

[1]: http://davmac.org/davpage/linux/async-io.html#sigio

[2]: http://lse.sourceforge.net/io/aio.html

[3]: http://lse.sourceforge.net/epoll/dph-smp.png

[4]: http://bulk.fefe.de/scalability/

[5]: https://www.akkadia.org/drepper/cpumemory.pdf

Thanks for the info. Yes I suppose zero-copy can be made to work but surely one needs to go through a LOT of trouble to make it work.

I'm curious about sending data for TCP through, don't you need to have the original data available anyway, in case it needs to be retransmitted? Do the overlapped TX operations (on Windows) complete only once the the data has also been acked? Are you expected to do multiple overlapped operations concurrently to prevent bad performance due to waiting for ack of pending data?

Windows was designed for an era where context switches were around 80µsec (nowadays they're closer to 2µsec), so a lot of that hard work has already paid for itself.

> don't you need to have the original data available anyway, in case it needs to be retransmitted? Do the overlapped TX operations (on Windows) complete only once the the data has also been acked?

I don't know. My knowledge of Windows is almost twenty years old at this point.

If I recall correctly, the TCP driver actually makes a copy when it makes the packet checksums (since you have the cost of reading the pages anyway), but I think behaviour this is for compatibility with Winsock, and it could have used a copy-on-write page, or given a zero page in other situations.