Hacker News new | ask | show | jobs
by xvector 828 days ago
> Pixels shipped a massive hardware security feature (MTE) they aren't enabling for the OS to save 3.125% memory/cache usage. It's silly. Heap MTE has near 0% perf overhead in async mode and is cheaper than increasingly ineffective legacy mitigations like SSP in asymmetric mode.

I really want to see someone from the Pixel team justifying the decision here. I really wonder what the thought process is for someone to disable such a significant security feature for negligible performance gain.

4 comments

AOSP engineer here; I don't have a deep knowledge of the issue at hand as I don't work on the bluetooth part of the OS, but I did just want to call out that GrapheneOS (while a really cool project!) supports ~11 total device types, and changes in AOSP have to keep a much larger context in mind.

I'm always looking for ways to implement cool stuff and make things in android better, but it's a little myopic to ignore the larger OEM ecosystem when complaining about specific feature roll outs.

I highly doubt the reason this isn't enabled is the 3% memory/cache usage, and there's some other consideration that's informing the decision.

> AOSP engineer here; I don't have a deep knowledge of the issue at hand as I don't work on the bluetooth part of the OS, but I did just want to call out that GrapheneOS (while a really cool project!) supports ~11 total device types, and changes in AOSP have to keep a much larger context in mind.

AOSP also only directly supports the phones/tablets we support. We were only talking about 2 of the devices supported by AOSP. Pixels enabling MTE is not connected to other devices. App compatibility is also not an issue because it can be enabled for the base OS with user installed apps excluded until they opt into it. It can be made opt-out in a future target API level (Android 15) and then the opt-out can be removed in the next target API level. This is entirely doable.

> I'm always looking for ways to implement cool stuff and make things in android better, but it's a little myopic to ignore the larger OEM ecosystem when complaining about specific feature roll outs.

We're not ignoring it. We're talking about the Pixel 8, Pixel 8 Pro and future Pixels. It's a request for Pixels to enable it, not AOSP to enable it for all devices.

> I highly doubt the reason this isn't enabled is the 3% memory/cache usage, and there's some other consideration that's informing the decision.

Performance and memory usage are definitely the only blocker to enabling it for the base OS. It would already be enabled if the security team was in charge of these decisions. It's not enabled because they likely have to prove it's fast enough. Our specific recommendation was enabling it in asynchronous mode for all cores for the base OS and user installed apps explicitly opting into it. We do not think it's a good idea to start automatically opting apps into using it until more Android devices support it. Google could require that ARMv9 devices support MTE as a developer option for Android 15 onwards in the CDD in order to make sure developers have it available for testing. App developers can't be expected to have a Pixel so it's unrealistic to have it forced on app developers via target API level until more devs have devices available to test it. We understand all that and what we are pushing for takes that into account.

The tradeoff isn't just memory use or performance -- it's also user-facing crashes that weren't present before. That is likely the bigger factor in deciding whether or not to enable the feature.
> The tradeoff isn't just memory use or performance -- it's also user-facing crashes that weren't present before. That is likely the bigger factor in deciding whether or not to enable the feature.

We're only proposing enabling it for the base OS and user installed apps opting into it. Google has already fixed nearly all the crashes due to testing with HWAsan and MTE. They don't test enough with real world usage yet because they haven't deployed it for all the dogfooding devices. To do that, they need to set up enabling it for the base OS without enabling it for all user installed apps because that's not currently very practical. Google has already done most of the work for enabling it for the base OS, not us. We have to fix some bugs, but they're almost all regressions which don't live past the next quarterly releases since they do find and fix them. Google is 100% capable of enabling MTE for the Pixel stock OS within the base OS without a significant increase to crashes for users. In fact, it will significantly decrease user-facing crashes once it matures. It will result in so many memory corruption bugs being fixed. Testing internally with MTE doesn't do the same thing as deploying it to production in terms of bug fixing and also doesn't provide hardening against the bugs not occurring during regular usage.

I don’t understand your argument here. Google has been working on fixing the their own crashes with the data they have right now. Why would they turn it on for everyone else while they do that?
They have already fixed nearly all the crashes in the base OS. The issues we face are almost entirely regressions in new versions. They fix them consistently but they aren't stopping the regressions getting into releases because they don't use MTE in production.
Regressions from whom?
Regressions in Android because they aren't doing enough real world testing with MTE or HWASan builds. They're clearly testing them via CI and fixing those issues but in real world usage more issues are uncovered which often slip into releases then get fixed in another release a few months later.
I think their accusation that the decision was made to save 3% memory usage is too presumptive. They also claim that no other OS is shipping with MTE enabled right now. The decision to enable is likely more nuanced.
It's based on communication with them. We've had it directly communicated to us. There are also multiple Google security engineers/researchers who liked/retweeted our posts. Google has stated the Pixel 8 is the first platform with MTE available in production devices, so it's not a large jump to the hardened alternate OS available for it being the first to deploy it in production. We have ~250k users on Pixels, and the userbase on the latest generation with MTE is quickly growing. People on 4th/5th generation Pixels need to move on due to them being end-of-life (other than the Pixel 5a, which will be soon) and we're encouraging moving to the ones with the biggest hardware security improvement since we started. We're not making things up.

They could enable it for the whole base OS and apps opting into it. They have already fixed nearly all the bugs uncovered in regular usage. They did nearly all the work but didn't take it over the finish line due to performance and memory/cache usage concerns. Their security engineers did their job already. They have very talented people working for them. Our ability to ship this feature before them is because the performance and memory concerns are not significant enough to matter to us. We're more than willing to lose 3.125% memory/cache and we accept the performance overhead of asymmetric MTE which is in the ballpark of a few percent overhead in most cases rather than near 0% like asynchronous MTE. There are cases where asymmetric MTE has a larger overhead than a few percent, but it's not common. Async mode is nearly free. MTE may not be as low overhead on future Pixels. It depends on them deciding to prioritize MTE performance in their future custom CPU design. If they do not ship it in production, it's unlikely that they'll prioritize the performance. The overhead may increase from 0% for async and a few percent for asymm to a far more significant cost.

The performance argument against MTE being deployed in production and against supporting MTE at all is the argument that's relevant. There is no other significant reason not to ship it for the base OS and enable it for all their own apps in their manifests. Getting it enabled for the whole app ecosystem is a much bigger problem requiring multiple steps: 1) broad availability of MTE capable devices for app developers, 2) making it opt-out instead of opt-in for a future target API level so developers get around a year and a half to either opt-out or deal with it, 3) removing the opt-out for a future target API level so that developers cannot simply opt-out. We know that part is hard. We know that part involves documentation, developer relations, concerns about giving app developers too much to deal with too quickly, etc. It isn't what we expect them to do short term. What we want them to do is enabling the near 0 overhead sync MTE for Pixels by default, with it used in the base OS and Google apps opting into it. They already did most of the work, even years earlier via HWAsan testing.

We don't expect them to enable asymmetric MTE or keep track of tags to provide more deterministic guarantees as we're doing. We understand they don't want to sacrifice 5% overall performance, and don't expect them to, but they could provide an opt-in for asymmetric mode + better deterministic guarantees. Google can could do it for Android 15 if they make the decision to do it now. A reasonable prediction is that in a couple years Apple ships MTE support in hardware with async mode by default and asymm in lockdown mode, and then Google does the same. They have a chance to be a leader on a hardware security feature far more valuable than the PAC feature where iOS is years ahead.

I feel like you're complaining, but it seems like Google has made historic advancements simply by pushing this technology to the point it's available and fixed most of not all crashes it finds. Stopping short of the goal by not enabling it on prod is likely a well reasoned choice. Google is highly committed to the underlying technology. It doesn't seem like the door to having it enabled in prod is forever closed and perhaps one day it'll happen. You don't have access to all of the information, so you're naturally going to jump to conclusions that might not actually be the best choice.
We have active communication with many people at Google about the areas we're heavily working on such as this and are not basing this on assumptions. You're talking about what you think happened based on your assumptions about it from lightly reading about it.

Most of the crashes were fixed via HWAsan before MTE existed in hardware. Their security people want MTE enabled in production. These issues were fixed before MTE and would have been fixed whether or not MTE was available in the standard ARMv9 cores/cache used by Pixels. MTE would likely already be enabled in production for the base OS (not user installed apps) if there weren't performance concerns. They clearly integrated it with that intention. We're more than capable of reading the commit messages and talking to engineers who worked on it along with other contacts there. They aren't keeping it a secret that there's a clear goal to enable MTE in production.

ARM provided MTE in their standard ARMv9 Cortex core designs. That's why MTE is available on the Pixel 8 and Pixel 8 Pro, because Tensor currently uses the standard core/cache designs. That's why MTE is not available for Snapdragon, and it's why it is theoretically available for MediaTek and Exynos.

The current availability of a high performance MTE implementation doesn't mean Google is highly committed to it. Pixels will be moving from standard Cortex core designs to their own core designs. There's an open question about whether MTE will be supported in the same way it is now. You're assuming that they're heavily committed to it and going to keep providing an extremely low overhead implementation. We have concrete reasons to be concerned.

Thanks for everything you do in security. You're a hero.
TL;DR I would not assume they are not using it, or that this is about 3.125% memory/cache usage.

Longer answer:

Google folks were responsible for pushing on Hardware MTE in the first place - It originally came from the folks who also did work on ASAN, syzkaller, etc. They are not in Android, but it was done with the help and support of folks in Android. That's the Google side, it was obviously a partnership with ARM/etc as well.

I was the director for the teams that created/pushed on it, way back when - this was years ago at this point because of the lead times on a hardware/architecture feature like this.

So i'm very familiar with the tradeoffs.

It is more than just the memory usage or cache. The post is correct that it was designed to be able to be enabled/disabled dynamically, and needed to have expected perf cost ~0, but the main use case at the time was sampling based bug finding.

That is, if you turn MTE on (whether servers, phones, whatever) for 1% of the time for your entire fleet, and you have a large enough fleet, you will find basically all bugs very quickly. You can do this during dogfooding, etc.

Put another way - the goal was to make it possible to use have the equivalent of ASAN be flipped on and off when you want it.

Keeping it on all the time as a security mitigation was a secondary possibility, and has issues besides memory overhead.

For example, you will suddenly cause tons of user-visible crashes. But not even consistently. You will crash on phones with MTE, but not without it (which is most of them).

This is probably not the experience you want for a user.

For a developer, you would now have to force everyone to test on MTE enabled phones when there are ~1 of them. This is not likely to make developers happy.

Are there security exploits it will mitigate? Yes, they will crash instead of be exploitable. Are there harmless bugs it will catch? Yes.

But keep in mind what i said at the beginning - i would not assume they don't use it.

I would instead assume they don't necessarily use it in production on all the time.

Anyone who has experience on hardware feature bringup of this kind will tell you the fact that they can boot and run the system and only when they do this one thing does it crash under MTE is actually a very good sign they do use it.

Otherwise it would have probably crashed a million times :)

As an aside - It's also not obvious it's the best choice for run-time mitigation.

We didn't propose enabling MTE for apps not opting into it any time soon. We proposed enabling it for the base OS by default. Pixels are already testing with HWASan and MTE so there are few issues found by it in the base OS. Enabling it for the base OS and apps opting into it would be a great start. Requiring working MTE support for ARMv9 in the CDD is entirely doable, and then devs will have devices with it, and it can be made into a default for apps at a new target API level with opt-out instead of opt-in. It can then be made into a mandatory feature at a future target API level. Android makes dramatically more aggressive backwards incompatible changes via target API levels than detecting memory corruption without false positives.

We know they're actively testing HWASan and MTE builds, but not with enough real world usage. If they tested it a lot on actual devices used by Google employees, they'd have fixed this Bluetooth LE audio issue before the release.

Thanks for the insight.

>As an aside - It's also not obvious it's the best choice for run-time mitigation.

What are some of the current contenders/arguments?