Hacker News new | ask | show | jobs
by PhilipRoman 364 days ago
Local privesc, don't care. If anyone still thinks that they can draw a security boundary anywhere with a shared kernel, they should really look at kernel CVE database (and be horrified). For every fancy titled exploit there are twenty that you've never heard of.

You can sort of do it if you carefully structure your program to restrict syscall use and then use some minimal and well audited syscall filtering layer to hide most of the kernel. But you really have to know what you're doing and proper security hardening will break a lot of software. To get a basic level of security, you have to disable anything with the letters "BPF", hide all virtual filesystems like /proc, /sys, disable io_uring and remove every CONFIG_* you see until something stops working. Some subsystems seem more vulnerable than others (ironically netfilter seems to be a steady source of vulnerabilities).

9 comments

> they should really look at kernel CVE database

When quoting kernel CVEs as evidence as signs of insecurity, especially so seemingly authoritatively, please make sure you're informed about how what Linux kernel CVEs mean.

A CVE (for any product) does not automatically mean there is actually a vulnerability there or even if one is exploitable unless explicitly noted (in the CVE or credibly by someone else). Proof of concepts, reproducibility or even any kind of verification are not a part of the CVE process.

For the Linux kernel in particular, the CVE process is explicitly to be "overly cautious" [1]. In practice, this means the Linux security team requests a CVE for anything that has a mere whiff of being theoretically exploitable. Of course that doesn't mean that the bug that was fixed was actually exploitable, not even theoretically but certainly not in practice.

As a result, you can't use CVEs reported by the Linux kernel to make claims about the (lack of) practical security of any Linux system, including your desktop. The CVEs reported by the Linux kernel are there to notify you to very well informed users of the kernel to do further risk assessments, not to be taken at face value as a sign of insecurity. [The latter is true for the entire CVE system - they're not to be taken at face value as signs something is wrong. But it's especially true for the kernel.]

[1]: https://docs.kernel.org/process/cve.html#process

This is a common complaint with the whole CVE process to begin with, and isn't even a Linux thing.
You're right. I review each one carefully, so here I mean only the real ones. It's still a massive amount of vulnerabilities, even after excluding obscure drivers or features that aren't used on headless systems.
After the Linux Foundation became a CNA (CVE Numbering Authority), it started issuing CVEs for a broad range of "vulns", such as local denial-of-service, memory errors with no viable exploit path, and logic flaws lacking meaningful security implications.

Looking at the raw number of CVEs is not very meaningful

Indeed. They issue a CVE for every bugfix, because it's long been the position of the linux maintainers that there's no meaningful distinction between a security bug and a regular bug.
And I'm not sure I can fault them for that, tbh. When you're a kernel, it's very hard to prove that something is a "non-security" bug -- especially when we count DoS as a security bug.
> memory errors with no viable exploit path

i dont appreciate putting "vulns" in scare quotes, if that was your intent

swiss cheese theory. all it takes is someone changing a component that allows that vulnerability to be chained into an exploit, which has happened many times.

these should be tracked, and in fact, it's very helpful to assign cves to them

but yeah, raw numbers is less useful. in fact, cves as a "is it secure or not" metric are pretty rough. it makes it easier to convince vendors to keep their software up to date, though...

Additionally, having simpler vulns labelled allows more juniors to work on coding fixes for them.and getting their feet wet in that particular sub field.
The way I deal with this at work is: we both work for a person who can fire us for looking at them funny. The threat of dismissal is sufficient for us to expect our peers to be rude neighbors but not criminal ones. If the divisions get big enough that this gets blurry, well then it’s simple enough to ask for private VMs/separate Kube clusters. The Conway’s Law aspects of server maintenance cycles when you report to separate directors/VPs is self evident.

And of course collocating different classes of work can lead to a bug in a low priority task taking down a high priority one. So those also shouldn’t run in the same partition. Once you’ve taken both of those into account, you’ve already added some security in depth. It’s hard even to escalate a remote exploit into a privilege escalation into attacking a more lucrative neighbor.

> anyone still thinks that they can draw a security boundary anywhere with a shared kernel

Containers are everywhere.

They don't work as reliable security boundaries; they're developer/ops tools.
Thomas, what are your thoughts on micro-vms such as kata containers? You can use them as a backend for docker in place of runc.

I'm sure you're well aware, but for the readers, they are isolated with a CPU's VT instructions which are built to isolate VMS. I still think "containers don't contain" in a very Dan Walsh boston accent, but this seems like a respectable start.

https://katacontainers.io

I have no strong opinion other than that untrusting cotenants shouldn't directly share a kernel.
They're slow and so unsuitable for dev work. They might be somewhat better for prod, but it depends on a wide selection of unproven hypervisors.
Which "unproven" hypervisors are those? Kata works with Firecracker.
QEMU is more well-known and tested than Firecracker; i.e., a hacked version is used in Xen used everywhere in the past decade while Firecracker is primarily an Amazon-only thing. Cloud Hypervisor, Dragonball, and StratoVirt aren't well-known or battle-tested IMO. The problem is none of these possess true manageability and isolation features of any solid type 1 hypervisor which makes Kata equivalent to a user-space application rather than a reliable platform with harder resource isolation guarantees.

https://github.com/kata-containers/kata-containers/blob/main...

I think they mean in regards to cross kernel attacks. vms didn't protect across speculative execution attacks.

I believe there are even more course grained timing attacks with dma and memory that are waiting to be abused.

Yet people use container based isolation all the time in practice and the sky doesn't fall.

Also, every security domain in an Android systems shares a kernel, yet Android is one of the most secure systems out there. Sure, it uses tons of SELinux, but so what? It still has a shared kernel, and a quite featureful one at that.

I don't buy the idea that we can't do intra-kernel security isolation and so we shouldn't care about local privilege escalation.

Android delegated some security features to a different kernel called Trusty that is separated from the main Linux kernel using virtualisation. That kernel runs high value security services.

https://source.android.com/docs/security/features/trusty

Yes, but that's not the main load-bearing security part of the system. Trusty doesn't isolate apps from each other. It doesn't isolate work profiles from user profiles. Regular SELinux-augmented thoughtfully-used uid- and process-isolation does that.
If you weren't aware, containers aren't a security boundary. Things like bubblewrap are.
Semantics make hard assertions about "containers" worthless. It depends on what one means by a container exactly, since Linux has no such concept and our ecosystem doesn't have a strict definition.
What to you think bubblewrap is, if not a container runtime?
bubblewrap is actually worse - there are known escapes in there that haven't been fixed for years
It is the most widely used sandbox layer for pretty much everything. What escapes are you talking about? Are we supposed to take your word for it? Come on
Wait. What? What escapes? Is it that bubblewrap not faithfully implement the policy you give it or that there are surprising gaps in the kernel's namespace isolation?
Ironically Ubuntu 24 now blocks users from accessing namespaces because that kernel interface had a bunch of local privilege escalations, breaking programs that want to use them for isolation.
For the last 10 years or so, namespaces in Linux were the source of the absolute hightest number of local privilege escalations and sometimes even arbitrary code executions in kernel space. Building a kernel without user namespace support has been goto-advice for multiuser systems for almost as long. Ubuntu is just late to the game because they mostly have server or single-user-desktop customers.
Actually I think device drivers got you beat there, but no ones suggesting we break them for users safety. Ubuntu today is more user hostile than Windows.
Device drivers are worse if you just count the numbers. But they are usually far less exploitable because very often you need to have the corresponding hardware plugged in or even need to manipulate said hardware to provide crafted inputs. So in reality, device driver problems are almost never exploitable.
Seems ironic considering namespaces are highly utilized for isolation/security purposes.
I presume they're left enabled for root.
The same software that wants to use namespaces for isolation will refuse to run as root.
I've even seen namespaces used for hiding malicious software in Ubuntu systems too.
Wouldn't Android's kernel have most of the hardening steps / disabled features described in GP's comment?
No. Things like eBPF, strace, and packet filtering are enabled. Android uses SELinux and other facilities to limit the amount of code the kernel will allow to access these features. Big difference from their being compiled out of the kernel entirely as the OP suggests is necessary.
Container isolation can fail at shared libraries in shared layers too can't it? My evil service is based on the same cooltechframework base layer as your safety critical hardware control service and if there is a mistake in the framework...
then it affects each one separately since they are separate processes. The fact they run the same code is irrelevant if the data is separate.
Separate processes running the same shared instructions. If you compromise and modify those shared instructions, the othe container runs instructions of your choosing.
Worse, cannot disable eBPF due to too many packages demanding it.

Namely, nft tables and its filtering.

I use my laptop logged in as root, so that's not an issue!
The best is they absolutely can install drivers without your permission unless your system is encrypted. So it's even worse!
Not really relevant, the threat being discussed is for multi-user systems.
And your pulse audio service is running as which user now? This is a local exploit but for any system supporting the mentioned combination of services, aka a lot of them, including the RHEL derivatives and likely Ubuntu.

https://almalinux.org/blog/2025-06-18-test-patches-for-cve-2...

> And your pulse audio service is running as which user now?

I'm not sure, I appear to be running pipewire. But assuming it's not my own account: not a user that will initiate an attack. A user account that allows logins or runs external servers would have to get compromised first, and at that point it can use the exploit directly with no need to touch pulseaudio.

If there's only one directory in your /home, it's very unlikely the urge for admins to patch this is directed at you.

Pipewire runs under the pipewire user, managed by systemd or OpenRC. Which means any of their managed processes can start a new pipewire user process.

A local priv-sec is one exploit [0] away from a remote one.

[0] https://www.bleepingcomputer.com/news/security/hackers-explo...

> Pipewire runs under the pipewire user, managed by systemd or OpenRC. Which means any of their managed processes can start a new pipewire user process.

The box I checked has no pipewire user and it's running under the account I logged in with.

> A local priv-sec is one exploit [0] away from a remote one.

That only matters for accounts that talk to the outside world.

If I'm the only user, I'm not depending on security features to keep my account and the pipewire account safe from each other. Privilege escalation is a big threat for systems that are running in a significantly different way.

That's a cheap cop out.
You've worked in security theatre before then ;)
Or you could just use NetBSD like SDF does.
Given this. Why is every linux device not rooted then.
Because GP is talking about theoretical vectors of attack in highly secure environments. Whereas you are now discussing why hackers don’t target devices with zero-financial gain.

Also just because syscall A might be vulnerable to a particular type of attack, it doesn’t mean that service B uses that syscall, let alone calls it in a way that can be exploited.

I think a majority of systems security people, if asked, would say they assume an attacker with code execution on a Linux system can raise privileges.
I think in the land of people with ill intent to exploit such things they have more potential targets and security vulnerabilities than they can spend time exploiting. A given vulnerability may be terrible, but it might not coincide with something worth bothering with for a given person with ill intent. There's a factor of human choice / payoff at play.