Hacker News new | ask | show | jobs
by peter_d_sherman 2437 days ago
http://irenezhang.net/papers/demikernel-hotos19.pdf

Excerpt:

"Researchers have long predicted the demise of the operating system [21, 26, 41]. As datacenter servers increasingly incorporate I/O devices that let applications bypass the OS kernel (e.g., RDMA [12] and DPDK [15] network devices or SPDK storage devices), this prediction may finally come true. While kernel-bypass devices do eliminate the OS kernel from the I/O path, they do not handle the kernel’s most important job: offering higher-level abstractions. This paper argues for a new high-level, device-agnostic I/O abstraction for kernel-bypass devices. We propose the Demikernel, a new library OS architecture for kernel-bypass devices."

That's the WHY of the Demikernel...

1 comments

> While kernel-bypass devices do eliminate the OS kernel from the I/O path, they do not handle the kernel’s most important job: offering higher-level abstractions

Well solarflare/openonload allow you to bypass the kernel and change literally none of your code via LD preload.

That can work, but if you want maximum performance, you need to use the ef_vi library, in openonload, that needs (a fair bit of) custom code. Exalink has libexanic, Napatech has their thing. libexanic is surprisingly elegant, and you can do a lot of the work (e.g. extracting a timestamp) while the rest of a packet arrives. Netronome has an eBPF way to allow you to run packet-handling code right in the NIC, maybe even freeing up a core.

Solarflare has ruled the roost, but Xilinx bought them out, and the future of their NICs is cloudy. Mellanox used to be a big deal; now they are part of NVIDIA. Mellanox and Solarflare (like Napatech) have spent a great deal of effort to make kernel bypass work for clients running in VMs.

Yea. Funnily enough I wrote a library in ef_vi for UDP sending with mixed results. Just UDP/MC dispatch and it was a little like grappling with an underdeveloped library. Pretty fast but I'm sure I hit some kind of memory barrier bug at one point.

The thing with onload is the sheer simplicity which makes it an easy sell. Also handy to tweak socket options with env vars. They also have a direct tcp lib if you need a some extra nanos. Templates sends too.

Not heard of netronome and never used libexanic.

These approaches mean you have to invest more in external passive monitoring tools as is stats are of little help.