Hacker News new | ask | show | jobs
by adsharma 98 days ago
> We plan to deliver improvements to [..] purging mechanisms

During my time at Facebook, I maintained a bunch of kernel patches to improve jemalloc purging mechanisms. It wasn't popular in the kernel or the security community, but it was more efficient on benchmarks for sure.

Many programs run multiple threads, allocate in one and free in the other. Jemalloc's primary mechanism used to be: madvise the page back to the kernel and then have it allocate it in another thread's pool.

One problem: this involves zero'ing memory, which has an impact on cache locality and over all app performance. It's completely unnecessary if the page is being recirculated within the same security domain.

The problem was getting everyone to agree on what that security domain is, even if the mechanism was opt-in.

https://marc.info/?l=linux-kernel&m=132691299630179&w=2

3 comments

I'm really surprised to see you still hocking this.

We did extensive benchmarking of HHVM with and without your patches, and they were proven to make no statistically significant difference in high level metrics. So we dropped them out of the kernel, and they never went back in.

I don't doubt for a second you can come up with specific counterexamples and microbenchnarks which show benefit. But you were unable to show an advantage at the system level when challenged on it, and that's what matters.

You probably weren't there when servers were running for many days at a time.

By the time you joined and benchmarked these systems, the continuous rolling deployment had taken over. If you're restarting the server every few hours, of course the memory fragmentation isn't much of an issue.

> But you were unable to show an advantage at the system level when challenged on it, and that's what matters.

You mean 5 years after I stopped working on the kernel and the underlying system had changed?

I don't recall ever talking to you on the matter.

> By the time you joined and benchmarked these systems, the continuous rolling deployment had taken over

Nope, I started in 2014.

> I don't recall ever talking to you on the matter.

I recall. You refused to believe the benchmark results and made me repeat the test, then stopped replying after I did :)

The patches were written in 2011 and published in 2012. They did what they were supposed to at the time.

For the peanut gallery: this is a manifestation of an internal eng culture at fb that I wasn't particularly fond of. Celebrating that "I killed X" and partying about it.

You didn't reply to the main point: did you benchmark a server that was running several days at a time? Reasonable people can disagree about whether this a good deployment strategy or not. I tend to believe that there are many places which want to deploy servers and run for months if not days.

For the peanut gallery more: I worked with both of these guys at Meta on this.

The "servers are only on for a few hours" thing was like never true so I have no idea where that claim is coming from. The web performance test took more than a few hours to run alone and we had way more aggressive soaks for other workloads.

My recollection was that "write zeroes" just became a cheaper operation between '12 and '14.

A fun fact to distract from the awkwardness: a lot of the kernel work done in the early days was exceedingly scrappy. The port mapping stuff for memcached UDP before SO_REUSEPORT for example. FB binaries couldn't even run on vanilla linux a lot of the time. Over the next several years we put a TON of effort in getting as close to mainline as possible and now Meta is one of the biggest drivers of Linux development.

It's not just that zeroing got cheaper, but also we're doing a lot less of it, because jemalloc got much better.

If the allocator returns a page to the kernel and then immediately asks back for one, it's not doing its job well: the main purpose of the allocator is to cache allocations from the kernel. Those patches are pre-decay, pre-background purging thread; these changes significantly improve how jemalloc holds on to memory that might be needed soon. Instead, the zeroing out patches optimize for the pathological behavior.

Also, the kernel has since exposed better ways to optimize memory reclamation, like MADV_FREE, which is a "lazy reclaim": the page stays mapped to the process until the kernel actually need it, so if we use it again before that happens, the whole unmapping/mapping is avoided, which saves not only the zeroing cost, but also the TLB shootdown and other costs. And without changing any security boundary. jemalloc can take advantage of this by enabling "muzzy decay".

However, the drawback is that system-level memory accounting becomes even more fuzzy.

(hi Alex!)

[ Edit: "servers" in this context meant the HHVM server processes, not the physical server which of course had a longer uptime ]

People got promoted for continuous deployment

https://engineering.fb.com/2017/08/31/web/rapid-release-at-m...

I think it's fair to say the hardware changed, the deployment strategy changed and the patches were no longer relevant, so we stopped applying them.

When I showed up, there were 100+ patches on top of a 2009 kernel tree. I reduced the size to about 10 or so critical patches, rebased them at a 6 months cadence over 2-3 years. Upstreamed a few.

Didn't go around saying those old patches were bad ideas and I got rid of them. How you say it matters.

This is why I always read the comments here.
That is, wow, a story.

At what point did you realize how different fb engineering was from what you expected?

This is why I love hacker news. I learn so much from these moments.
Like "never work at Meta unless you can out-toxic your coworkers".
Yea I knew meta was toxic, but publicly beefing over something over a decade ago is a whole other matter. I can’t even remember what I was working on 10 years ago, and even if I did I wouldn’t be bringing people down that much later.
Inside Meta, engineers are one of the kindest group of people.

This thread would've been way more fun with a couple of middle managers and product managers in the mix ;-)

Funny, I was thinking what a relief it was to see people making their arguments frankly like on the HN of 10+ years ago.
Like "Hey, I wonder if Conway's Law works both ways. Huh. Wow. It looks like that is indeed the case."
I came here for the article, stayed for the drama.
I wouldn't be surprised if both 'adsharma' and 'jcalvinowens' were right, just at different points in time, perhaps in a bit different context. Things change.
I like your clocks!
Maybe I'm misreading, but considering it OK to leak memory contents across a process boundary because it's within a cgroup sounds wild.
It wasn't any cgroup. If you put two untrusting processes in a memory cgroup, there is a lot that can go wrong.

If you don't like the idea of memory cgroups as a security domain, you could tighten it to be a process. But kernel developers have been opposed to tracking pages on a per address space basis for a long time. On the other hand memory cgroup tracking happens by construction.

> across a process boundary

> within a cgroup

Note the complementary language usage here. You seem to have interpreted that as me writing that it didn't matter what cgroup they are in, which is an odd thing to claim that I implied. I meant within the same cgroup obviously.

Yes, you can read memory out of another process through other means.. but you shouldn't map pages, be able to read them and see what happened in another process. That's the wild part. It strikes me as asking for problems.

I was unaware of MAP_UNINITIALIZED, support for which was disabled by default and for good reason. Seems like it was since removed.

I was clarifying that there are CPU cgroups, network cgroups etc and the proposal touched only memory cgroups.

The people deploying it are free to restrict the cgroup to one process before requesting MAP_UNINITIALIZED if there is a concern around security. At that point the memory cgroup becomes a way to get around the page tracking restriction.

But I get why aesthetically this idea sounds icky to a lot of people.

What metrics were improved by your patches?
Some more historical context. It wasn't a random optimization idea that I thought about in the shower and implemented the next day. Previous work on company wide profiling, where my contribution was low level perf_events plumbing:

https://research.google/pubs/google-wide-profiling-a-continu... https://engineering.fb.com/2025/01/21/production-engineering...

The profiling clearly showed kernel functions doing memzero at the top of the profiles which motivated the change. The performance impact (A/B testing and measuring the throughput) also showed a benefit at the point the change was committed.

This was when "facebook" was a ~1GB ELF binary. https://en.wikipedia.org/wiki/HipHop_for_PHP

The change stopped being impactful sometime after 2013, when a JIT replaced the transpiler. I'm guessing likely before 2016 when continuous deployment came into play. But that was continuously deploying PHP code, not HHVM itself.

By the time the patches were reevaluated I was working on a Graph Database, which sounded a lot more interesting than going back to my old job function and defending a patch that may or may not be relevant.

I'm still working on one. Guilty as charged of carrying ideas in my head for 10+ years and acting on them later. Link in my profile.

This kind of thing always struck me as something that the MMU and the memory controller could team up on. When you give back memory, you could not refresh it for some cycles. Or you could DMA the same page of zeros over all of it, so the CPU isn't involved in menial labor.
This is an old debate that goes back 25+ years. One of the differences in how Linux and FreeBSD handle the issue.

Linux developers believe that involving the CPU warms the caches and is a good thing.