In traditional I/O, a hardware interrupt is triggered whenever data arrives at hardware boundary and the interrupt can get serviced by any core that is available to the scheduler. One can imagine how much overhead is involved in context switching whatever that core was doing before, setting up the registers, moving data and then relinquishing the core back to OS - in this model, dedicated cores serve I/O in a memory mapped ring buffer like data structure sized to your application needs. There is no allocation/deallocation overhead, no management beyond moving a pointer and no context switching. If you can spare the cores, this can significantly improve performance.
In one use-case, I was able to quadruple the performance on a 32 core xeon by installing 4 10gbps ethernet cards and dedicating the first eight cores to I/O (2 per interface). This is all about latency but with proper care, it also improves throughput.
Someone more familiar with kernel workings than me should clarify, but my understanding is that IO generally happens via a syscall which requires the thread/process in question to context switch between userspace and kernel space, which can be very expensive. By enabling IO polling in userspace, you get to avoid that context switching.
The motivating benefit is performance, but a side one the author mentioned on Twitter https://twitter.com/axboe/status/1073320502532263936 is sidestepping Meltdown and similar vulnerabilities from having the kernel and the OS in the same address space (even though they're separated by a privilege boundary). In a scheme like this, you can theoretically dedicate one core to the application and a separate one to the kernel, and minimize speculation, cache sharing, etc. between the two. The application and the kernel share a portion of memory, so the kernel doesn't ever run on the application's CPU.
This is questionably practical for a general-purpose machine, but for a server system used entirely as a hypervisor, or web server, or file aerver, or something, it might fit really well.
Depends on the use case; keep in mind that syscalls are slow, too. If you have an application that does significant computation on lots of data (think a scientific calculation/simulation), having another core on the same socket read ahead from disk to RAM might be much more efficient than pausing computation to read synchronously. Or if you're a file server that is just passing things back to the kernel's network layer, you might not even need to see the contents of RAM yourself.
Performance, latency variability and sanity. It’s somewhat easier to write applications without having to make a syscall, which may have unknown latency (even if it’s a non blocking poll).
In one use-case, I was able to quadruple the performance on a 32 core xeon by installing 4 10gbps ethernet cards and dedicating the first eight cores to I/O (2 per interface). This is all about latency but with proper care, it also improves throughput.