Hacker News new | ask | show | jobs
by dist1ll 1083 days ago
Many general-purpose OS abstractions start leaking when you're working on systems-like software.

You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.

2 comments

Yes. I think mmap() is misunderstood as being an advanced tool for systems hackers, but it's actually the opposite: it's a tool to make application code simpler by leaving the systems stuff to the kernel.

With mmap, you get to avoid thinking about how much data to buffer at once, caching data to speed up repeated access, or shedding that cache when memory pressure is high. The kernel does all that. It may not do it in the absolute ideal way for your program but the benefit is you don't have to think about these logistics.

But if you're already writing intense systems code then you can probably do a better job than the kernel by optimizing for your use case.

Web servers doing kernel bypass for zero-copy networking? Do you have a specific example in mind? I'm curious.
The most common example is DPDK [1]. It's a framework for building bespoke networking stacks that are usable from userspace, without involving the kernel.

You'll find DPDK mentioned a lot in the networking/HPC/data center literature. An example of a backend framework that uses DPDK is the seastar framework [2]. Also, I recently stumbled upon a paper for efficient RPC networks in data centers [3].

If you want to learn more, the p99 conference has tons of speakers talking about some interesting challenges in that space.

[1] https://www.dpdk.org/.

[2] https://github.com/scylladb/seastar

[3] https://github.com/erpc-io/eRPC

Interesting. I hear a lot more about sendfile(), kTLS and general kernel space tricks than I do about DPDK and userspace networking, but maybe it's just me.

I do wonder what trend is going to win: bypass the kernel or embrace the kernel for everything?

The way I see it, latency decreases either way (as long as you don't have to switch back and forth between kernel and user space), but userspace seems better from a security standpoint.

Then again, everyone is doing eBPF, so probably the "embrace the kernel" approach is going to win. Who knows.

The people who use DPDK and the like are a lot quieter about it. The nature of kernel development means that people tend to hear about what you're doing, while DPDK and userspace networking tends to happen in more proprietary settings.

That said, I'm not sure many people write webservers in DPDK, since the Kernel is pretty well suited to webservers (sendfile, etc.). Most applications that use kernel-bypass are more specialized.

The downside, of course, is that each program owns one instance of the hardware. Applications don't share the network card. This isn't a general purpose solution.

That may be acceptable for your purposes, or it may not.

Probably the most common example is sendfile() for writing file contents out to a socket without reading them into userspace:

https://man7.org/linux/man-pages/man2/sendfile.2.html

Isn't that the opposite? That is, bypassing user space, not kernel space?
Oh, hmm, yeah, perhaps OP meant something more like using raw sockets to get packets directly into userspace without relying on the kernel to arrange them into streams?

I'm not very familiar with that though.

Yes, I knew about sendfile() but I wasnt't aware of any web server using that (though I know Kafka uses it).

Then I found out Apache supports it via the EnableSendfile directive. Nice.

>This directive controls whether httpd may use the sendfile support from the kernel to transmit file contents to the client. By default, when the handling of a request requires no access to the data within a file -- for example, when delivering a static file -- Apache httpd uses sendfile to deliver the file contents without ever reading the file if the OS supports it.

Pretty much all modern Linux web servers support sendfile(). Examples:

* nginx: [1] * Haskell webserver module: [2] * caddy: [3]

[1]: https://nginx.org/en/docs/http/ngx_http_core_module.html#sen... [2]: https://hackage.haskell.org/package/warp-3.3.28/docs/Network... [3]: https://github.com/caddyserver/caddy/pull/5022

I'd expect most serious web servers support it. I've written one that does (workerd), it's not too hard.

That said, it's tricky to use if the server also does TLS termination... then you need kTLS, which is a much bigger can of worms.

Sendfile isn’t kernel bypass.