Hacker News new | ask | show | jobs
by pengaru 1461 days ago
PSA: systemd-journald uses shared file-backed mappings via mmap() for its journal IO.

You must subtract its shared memory use from its resident memory use before judging how much memory it's consuming. The file-backed shared mappings are reclaimable, because they are file-backed. The kernel will just evict the mapped journal pages at will, since they can always be faulted back in from the filesystem.

TFA is much ado about nothing, learn to measure memory use properly before breaking out the pitch forks.

Full disclosure: I've hacked a bunch on journald upstream.

5 comments

This is true, but the author seems to be running their services on several Raspberry Pi like devices whose flash storage may be unstable or quick to wear out. Eliminating unnecessary writes and swap space (depending on the application), those megabytes of extra memory may be just enough what tricks the system into committing memory into swap.

You can run quite a lot in 512MB of RAM if you use the right languages to write code in. I was surprised about how little RAM my moderately complex daemon written in Rust uses, for example; I expected to have to allocate a gigabyte of RAM to the VM running it (based on what other tools similar to what I was doing needed) but the entire system turned out to be quite comfortable with just a quarter of that. I didn't even try to optimise for memory usage, which is what made this so surprising. I stil had to give it some more RAM because unattended upgrades tended to get stuck, but I learned a lesson that day.

Ever since I've been meaning to try to mess with Firecracker + bare bones daemons to run virtual machines services with absolutely minimal overhead. I like the virtualisation boundaries from a security standpoint much more than container boundaries and now I wonder how much I can shrink my overhead by.

> This is true, but the author seems to be running their services on several Raspberry Pi like devices whose flash storage may be unstable or quick to wear out. Eliminating unnecessary writes and swap space (depending on the application), those megabytes of extra memory may be just enough what tricks the system into committing memory into swap.

Well the author seems to want text logs instead, which seems much much worse for this.

If you're concerned about storage wear you'd just run journald without /var/log/journal so it's volatile (tmpfs) only. At least that way you still have journals for your current boot and functionality like `systemctl status $service` can still tell you some journal information.
Yeah this is a lot of work to avoid reading journald.conf, switching the storage to volatile, and capping the memory usage to whatever you want.
>You can run quite a lot in 512MB of RAM if you use the right languages to write code in.

I recently delivered a production-ready embedded system running Armbian with 512megs RAM, and indeed disabled systemd-journald for our uses, also .. but even with it enabled, our Lua-based app was (science/data analysis on sensor network) running in the best environment it has ever run, so I can confirm: 512MB is enough for a lot of things.

512MB is absolute overkill for the application that you built, it is the choice of OS + the tooling used that resulted in that requirement. Not all that long ago 32 MB served a whole bank, and embedded systems used kilobytes of RAM, not megabytes. We've gotten so used to slapping a full unix server into stuff that we hardly even think about it any more and just take that kind of power completely for granted. I'm not saying you made any wrong choices, it's just that most of the embedded stuff that I come across would be just as feasible on a fraction of the CPU (and power) budget than what we typically choose because for instance Lua is such a convenient choice for a platform like that.
Windows 95 ran an entire OS with decent UI in 8 MB of RAM. One really has to wonder, where is all the RAM going these days? I think the knowledge of doing anything with only 8 MB of RAM has gone away, we don't know how to do it anymore.
Your comment brings the following Monty Python sketch to mind:

“What did the Romans ever do for us?

… All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health”

Replace “Romans” with “Increased RAM usage”.

It's not the knowledge - it's the increased complexity of the entire stack, all the way down to the hardware. A modern linux kernel image is easily bigger than 8MB, and that needs to be in memory at all times. Why? Because of all the functionality it has these days, to fit all the possible usecases people need. Windows 95 didn't have Swap, didn't support many filesystems, didn't have central logging, didn't have ASLR, let alone support for containers, and many other features I'm forgetting along the way.

Sure you could strip away a lot of that functionality, even at the distribution level (by for example not using an init system at all, instead just one shell script to initialize things), but then you'd end up with an operating system that's not general purpose for today's standards anymore.

Don't forget how much higher screen resolutions are these days. Color depth also. Those 8 MiB systems were driving single-buffered displays with perhaps 800x600 resolution at eight bits per pixel, with a color palette and dithering, which requires about 480 KB to hold the framebuffer image. Most applications would render directly into the framebuffer. A full HD (1080p) screen at 32 bits per pixel requires 8 MiB just to store the framebuffer (16 MiB with double-buffering), and that's not counting any of the input data or code needed for rendering. Figure on two or three times that to hold separate textures for each window (depending on the window sizes and how much they overlap) so that they can be composited live with desktop effects.
> Windows 95 didn't have Swap

it did have virtual Memory and swap

Can confirm. I remember it swapping heaps on my 32MB machine.
A huge amount of it is going to graphics. A 4K screen is ~31 MB just for the framebuffer. In comparison, 640x480x16 colors is 150K of memory.

Windows 95 also didn't do things the modern way. It didn't keep an image of every application's windows in RAM. It kept track of what covered up what, and then asked applications to redraw themselves when needed.

Another huge amount is going to features like internationalization. Unicode is a beast that takes a good amount of code to implement, and Arial Unicode is a ~20 MB TTF file.

Modern luxuries like being able to tweet in Japanese are quite expensive.

It's not quite that simple: while clean file-backed pages are cheaper than, say, private dirty pages (which the kernel must preserve as long as anyone references them), they're not free: you're still paying an opportunity cost. That is, the kernel is, at least for a time, keeping each clean file-backed page resident when it could be keeping some other page, perhaps a more useful one, in RAM instead. If systemd-journald is append-mostly, it'd be useful to MADV_FREE (after msync) any pages behind the current write pointer so as to give the kernel a hint that it can get rid of those clean file-backed pages early. I'd actually suggest getting rid of the use of memory mapping entirely, but doing so would likely be a bigger ask.
> the kernel is, at least for a time, keeping each clean file-backed page resident when it could be keeping some other page

It's almost the same result for the standard page cache when you're reading a file, isn't it?

Did the Linux kernel ever fix the thing where it evicts code pages at the same priority as files mapped read/write?

I haven't checked in the last 4 years or so, but, before that, every time I've worked with a Linux-based storage system that used mmap to write to files, I've ended up rewriting it to use pread/pwrite.

Each time, there was no perceptible CPU hit, but there was a massive page cache / memory pressure win. It turns out that aggressively evicting warm code pages then faulting them back in is bad for system performance, even with a fast SSD.

There's nothing to "fix" here, in some cases what you want is not optimal. It is perfectly reasonable for the kernel to prioritize data pages you touched more recently than code pages by default. It's essentially a big LRU, always has been.

If you don't like that, you can always use mlock(). You can also tune things like writeback sysctls and readahead behavior. But I disagree it's "broken" because it doesn't do what you want by default.

In a post-spectre/meltdown world syscalls are a bit more expensive, you'd be hard-pressed to compete with the journal's mmap windows especially for a warm page cache, using pread/pwrite. Especially if you just went naively about it and tried turning every little object access into its own little island of buffered IO. The objects in the journal are quite small, so you'd likely end up having to implement your own page cache/buffer manager in userspace to coalesce the syscalls.

It'd be far more interesting to explore an io_uring based implementation IMNSHO.

I was wondering about this exact thing when the article didn't break down the ram usage. Thanks!
I was waiting for the author to check the memory usage of rsyslog and becoming enlightened... but it didn't happen. Reminder: check your assumptions/result after changes. He could learn that rsyslog uses way more shared memory than journald (>900M on my system) and it doesn't matter.
Memory use has always been a mystery to me and I can easily miss some things. Thanks for pointing to. Anyway, the right solution for me is tldr, all the rest is shit.
Yeah, the rest just shows that op likes to do things like they've always done it. How you can prefer to poke around syslog and ps output to determine the state of a service instead if just doing systemctl status is beyond me for example.
Because systemctl status is called by monitoring tool. It's a habit, yes. If monitoring shows the service is down, no point to manually use systemctl, and syslog with ps become best friends. And sometimes date command as well. PIs don't have hardware clocks and wrong date may lead to errors that look mysterious.
That (syslog,ps) method was likely formed by habit during the many years before systemctl existed.