Hacker News new | ask | show | jobs
by djsumdog 3007 days ago
I remember seeing the PCI userspace option in the Linux kernel menuconfig and wondered why anyone would do that, and then a few years ago at Kiwicon I saw my first use case. A presenter was trying to hack a Cisco router.

Older Cisco routers ran IOS directly on proprietary hardware. At some point, Cisco decided to switch to Intel hardware but didn't port their kernel. They use a Linux kernel and ran IOS as a huge 50MB+ binary. The guy doing the talk got shell access and only found one ethernet device when running ifconfig. The actual switching hardware was being handled in userspace by the large binary.

I'm guessing they probably just wrote some shim layers to connect their PCI drivers up to the userspace PCI Linux API.

4 comments

Well, most of the magic of hardware routers also comes in the form of hardware acceleration of the actual data plane - ie. L3 switching - on the silicon itself. That's what makes them fast and so expensive. That sort of mechanism doesn't map nicely into the Linux interface paradigm (unless you do things like exporting kernel routes into the hardware, but that's borderline absurd).

I think even if the driver were to be implemented in kernelspace, it would still probably not expose any of it's physical interfaces to userspace as plain ethernet devices, maybe apart from virtual/mgmt ones to run SSH on, and perhaps one so that the kernel can handle packets that the router doesn't have flows programmed for (like in OpenFlow).

> That sort of mechanism doesn't map nicely into the Linux interface paradigm (unless you do things like exporting kernel routes into the hardware, but that's borderline absurd).

Not absurd at all. Cumulus (which I cofounded) does exactly that. There are >1000 customers, including several of the largest cloud operators in the world.

It works really well in practice, since you can just fall back to the kernel for non-fast-path stuff like ARP. IOS/NXOS implement ARP (and everything else) themselves. We can just use the kernel's implementation.

The idea is essentially to use the lightning fast forwarding ASIC as a hardware accelerator for the networking functionality the kernel already has.

> I think even if the driver were to be implemented in kernelspace, it would still probably not expose any of it's physical interfaces to userspace as plain ethernet devices, maybe apart from virtual/mgmt ones to run SSH on, and perhaps one so that the kernel can handle packets that the router doesn't have flows programmed for (like in OpenFlow).

That's basically how switch development works in a nutshell, look at Broadcom's OpenNSL.

Isn't switchdev supposed to provide a way to make an interface to in-silicon forwarding engines?
For years and years, the X server was effectively a userspace device driver. It would map the configuration registers and the framebuffer and do everything outside the kernel. And it worked fine, for the most part.

Once GPUs arrived, the ability to do latency-critical management of the device state became important and the register management moved into the kernel. But for traditional framebuffers the device setup was for the most part done once, and there's no particular need for that to be managed outside userspace.

Also for a long time there was no need to do any kind of fine-grained synchronisation with the graphics hardware apart from usual IO wait states and tha whole thing could be accessed as few memory mappings without any kind of interrupt handling (even to the extent of Sun's proprietary UPA slot not even supporting interrupts in it's low-cost graphics-only incarnation).
From what I know, the reason they moved device setup to the kernel was to avoid flickering when the system switches from the boot screen to the login manager.
There were a bunch of reasons.

One important one is that accessing the PCI config space via IO ports 0xCF8/0xCFC is racy with the kernel, since a read or write requires writing the BDF address to 0xCF8, and then reading/writing the data from 0xCFC. If the kernel tries to do this dance while the X server is doing it as well one of them is going to read or write the wrong address.

Interestingly, this design required in the X server to run under binary translation in VMware's monitor, even though it was userspace code, because it had to elevate its IOPL to be able to read/write the IO ports. CSRSS.EXE in windows also ran in BT, since it too was driving the graphics card before NT4. After NT4 moved the graphics code into the kernel, no one remembered to take out the IOPL elevation code, so at least until XP (and probably later) CSRSS.EXE runs with elevated privileges that it didn't need.

Csrss.exe is the userspace part of the win32 personality. It controls all win32 processes, which on a Windows system is just about every process.

It is not very useful to limit its privileges.

After the graphics driver moved out of it and into the kernel, it probably no longer needs the ability to turn off interrupts and read and write legacy IO ports.
Yes. Most of the interesting parts of modern graphics drivers are loaded into the X client process these days, anyway. Whoever wants to render something talks directly to a thin kernel interface with the X server out of the way except for some high level management stuff.
It's far more than that in the end, but yes: mode switching is a spot where you need whole-system management of the resources. The XFree86 binary couldn't easily make assumptions about what someone else was doing.
It may be equal parts GPL avoidance. Broadcom switch ASIC PDKs are a kind of hybrid kernel and userland application with no legacy reason, so I assume it is just arbitrarily about working around a restrictive license.
For those who are confused, this is not Apple's IOS, but Cisco's OS that was once (not sure about now) called IOS.
Still called IOS (well, there's also IOS-XE, IOS-XR, NX-OS... but that's a different story). Why would they change the name?
True, because Apples OS is called iOS.