Hacker News new | ask | show | jobs
by pron 8 days ago
The RAM difference is primarily because both Native Image (what you call Graal VM) and Go use much simpler and less efficient memory management techniques. HotSpot uses much more RAM by design as there are inefficiencies caused by using too little of it. Memory management - and especially very sophisticated approaches that are only used by the best resourced teams - is an especially misunderstood aspect.

I gave a talk on the subject that I hope will be published soon, and while I can't reproduce it here, let me give an example that offers some basic intuition. Imagine needing to do some computation in two ways on a machine with 1GB of free RAM. You could run for 10s, taking up 100% CPU and consuming 80MB of RAM, or for 9s, taking up 100% CPU and consuming 800MB of RAM. The second is more efficient, despite taking up 10x more RAM and saving "only" 10% of CPU, regardless of the relative cost of RAM and CPU. This is because taking up 100% of the CPU effectively captures 100% of RAM (as no other program can use it), so both programs capture the entire 1GB only the second one captures it for a second less. This scales to non extreme situations because accessing RAM requires CPU, so using CPU means capturing RAM whether you use it or not. So HotSpot uses it if it can use it to balance the CPU utilisation.

In some situations it may not matter, and I assume that if Native Image and Go work just as well for you, then the workload isn't very high, but under high workloads, this can matter a lot.

3 comments

Nice, I like to see your talk video, audio. Thanks.
> This is because taking up 100% of the CPU effectively captures 100% of RAM

Isn’t that only true though specifically at 100% CPU utilization?

If it were at 90% CPU, then you have no RAM capture, and then you can’t say anything about whether 80 or 800MB should be taken; it’s only a freebie if and only if literally no other program can do work on the machine.

I don’t see how you can map X% CPU utilization to Y% RAM capture.

Like a program could be network heavy, CPU light and mmaps a large file? Or streaming a file from disk with a constant memory allocation, but doing heavy nonstop CPU work.

The CPU / RAM capture ratio would be wildly different; the ideal for your program, while other competing programs of unknown behaviors exist, I don’t see any way for hotspot to approximate

> Isn’t that only true though specifically at 100% CPU utilization?

No. Because any RAM access requires CPU, using up any CPU effectively captures some ability to use RAM.

> I don’t see how you can map X% CPU utilization to Y% RAM capture.

You're right that there isn't a fixed formula, but the most efficient balance can have a narrow range, because CPU and RAM are typically sold as a package with a rather narrow RAM/core ratio (usually between 0.5 and 4GB, where the lower end is usually when you have slow cores). This is also because of the intrinsic relationship of RAM and CPU.

> Like a program could be network heavy, CPU light and mmaps a large file? Or streaming a file from disk with a constant memory allocation, but doing heavy nonstop CPU work.

A program that is very CPU light can't make use of a lot of physical RAM at any one time (again, because using RAM requires CPU). Once exception is caching, but memory access patterns for caching are easily detectable, and you can (and Java does) offer a different balance for them. I covered that in my talk, which will be eventually published on YouTube.

> I covered that in my talk, which will be eventually published on YouTube.

Any idea how I get myself notified once it’s up? Or a YT account to poll

https://www.youtube.com/java

Don't confuse it with the interview about my talk, which is already up, but doesn't cover any of the important details.

>HotSpot uses much more RAM by design as there are inefficiencies caused by using too little of it.

Ah yes, the swapping induced by IntelliJ overflowing my system RAM is supposed to reduce the inefficiencies of using too little memory. Great...

Thanks pron, you've fully bought into all the JVM kool-aid talking points without ever trying to question them. One of the reasons I upgraded to 32 GB RAM in 2019 was to run a Minecraft modpack. Minecraft is one of the most memory intensive games I've ever played.

When you consider that the smallest cloud instances that cost $4 per month only give you like 512 MB of RAM and have refused to upgrade for at least a decade, the idea of using more than 512 MB to be "more efficient" is ridiculous. It raises your minimum costs to $10 per month.

>I gave a talk on the subject that I hope will be published soon, and while I can't reproduce it here, let me give an example that offers some basic intuition.

>Imagine needing to do some computation in two ways on a machine with 1GB of free RAM. You could run for 10s, taking up 100% CPU and consuming 80MB of RAM, or for 9s, taking up 100% CPU and consuming 800MB of RAM.

This is the "wasted RAM is unused RAM" mentality and it doesn't work, because you usually have multiple competing programs and when you run out of RAM, your system will start swapping. This will then require you to buy more RAM, leading to more leftover RAM, which is then wasted and gets consumed by the applications again. It's nonsense.

Then there is the fact that the vast majority, basically 99.9% of algorithms are not scalable in the naive way presented. Nobody will waste resources on writing the same algorithm twice for these two cases. Databases are usually designed to either be primarily file system backed or in-memory backed. They will use the extra memory to hold indices and let the OS do the caching or they will reserve all the memory up front, intentionally leaving nothing for other applications.

>The second is more efficient, despite taking up 10x more RAM and saving "only" 10% of CPU, regardless of the relative cost of RAM and CPU. This is because taking up 100% of the CPU effectively captures 100% of RAM (as no other program can use it), so both programs capture the entire 1GB only the second one captures it for a second less.

Ok, now you're just writing nonsense. Nowadays people have CPUs with multiple cores and use an OS with a scheduler. If you have two programs taking up 100% of the CPU, the OS will give each process some of the hardware resources. You can't just assume some 100% CPU blockage here just because it is convenient for your argument. It's especially dishonest since even a 99% CPU blockage basically makes your argument fall apart completely.

If you have two programs decide to 10x the memory consumption to save one second, you'll most likely run into swapping issues, which will actually lock up your system for several seconds at a time and if you're unlucky, the OOM killer strikes or the compositor freezes up and you have to reboot. You're saying that a 1 second savings is worth an endless amount of inconveniences.

>This scales to non extreme situations because accessing RAM requires CPU, so using CPU means capturing RAM whether you use it or not. So HotSpot uses it if it can use it to balance the CPU utilisation.

Again, this is completely incorrect in so many ways that you're bragging you know nothing about how modern computers work.

CPU cores have their own local memory resources called caches. Depending on how your code is written, you may tile your data so it fits entirely in cache and operate within the local memory.

When performing inter thread communication, there are often situations where the data often doesn't even get written and then loaded to main memory, since atomic operations can make use of the MESI cache coherency protocol to pull the data directly from another cores' cache.

Nowadays DMA is the standard way to perform large data transfers to hardware peripherals. If you load a file from an HDD, the SATA peripheral will communicate via DMA to copy whole sectors or file system blocks. The same applies to sending data to an SSD, network interface, GPU or basically anything else that performs bulk transfers (1 KiB+). The DMA engine is a separate component independent of the CPU and it may write data directly into cache as well.

Then there is the fact that RAM is a form of storage and storage is usually characterized by the fact that it takes up an area and said areas can be subdivided. When RAM is used, the portion of used RAM is considered blocked for the duration of how long it is stored, independently of whether it is accessed or not. This means that the most important objective is having sufficient amounts of RAM to store all data, not to occupy all of it preemptively even when it is not really needed.

The same can't be said of CPUs. Occupying the CPU usually means actively using the CPU. The only exception to this is things like spinlocks which should be avoided like the plague. By what the CPU is occupied is determined by the OS, therefore your logic is backwards. It's not the program blocking the CPU and therefore blocking the memory. The OS decided to stop running your process to run another process. Progress is slowed down, but it is not blocked.

Actual blockage only occurs when two processes compete for a fixed resource so that it is not possible to run both processes simultaneously, so that one process has to be closed to run another process.

> Ah yes, the swapping induced by IntelliJ overflowing my system RAM is supposed to reduce the inefficiencies of using too little memory. Great...

That's like me saying, oh great, so the swapping introduced by MS Word or Outlook shows just how efficient C++ is...

> Thanks pron, you've fully bought into all the JVM kool-aid talking points without ever trying to question them.

Oh I didn't just "buy" them. As a low-level programmer who's suffered for a long time from intrinsic inefficiencies and C++, I became a compiler and runtime engineer working on the JVM to solve the problems I had in C++.

> This is the "wasted RAM is unused RAM" mentality and it doesn't work, because you usually have multiple competing programs and when you run out of RAM, your system will start swapping

No, it's actually more involved and interesting than that, but you'll have to wait for my talk.

> Ok, now you're just writing nonsense. Nowadays people have CPUs with multiple cores and use an OS with a scheduler. If you have two programs taking up 100% of the CPU, the OS will give each process some of the hardware resources. You can't just assume some 100% CPU blockage here just because it is convenient for your argument

I didn't. I specifically said it was just an example to demonstrate the inter-relatedness of RAM and CPU since accessing RAM requires CPU. To understand why every single language that can isn't limited by other constraints and has the engineering resources to do so uses the same basic memory management algorithm as Java I guess you'll have to watch my talk when it's published.

> Again, this is completely incorrect in so many ways that you're bragging you know nothing about how modern computers work.

Wow. I guess it doesn't take much to be an engineer working on safety critical realtime applications and then on one of the worlds most advanced optimising compilers and you can get pretty far without knowing how computers work.

> CPU cores have their own local memory resources called caches. Depending on how your code is written, you may tile your data so it fits entirely in cache and operate within the local memory.

The data you need to access at any one time and the overall memory consumption of your program are two very different things. Maybe you don't know this, but CPU caches don't work by caching a large contiguous portion of the address space.

> When performing inter thread communication, there are often situations where the data often doesn't even get written and then loaded to main memory, since atomic operations can make use of the MESI cache coherency protocol to pull the data directly from another cores' cache.

I find it hilarious that you're trying to teach me about MESI, given that designing algorithms and data structures that are efficient on top of MESI was one of my jobs [1], and I advised Intel on architecture, but okay, maybe I know nothing about computers, as you concluded from a paragraph where I tried to give people who may not be compiler or memory management experts some intution about modern memory management design.

FYI, modern malloc/free allocators are also intentionally less footprint-optimised than older ones to get better performance (although they can't offer all the optimisations of moving collectors because they're not allowed to move pointers), but maybe none of the people writing the compilers or memory management mechanisms you use know computers as much as you do, and you know all there is to know.

[1]: I later even wrote, for a general audience, about data structures over distributed MESI (well, MOESI to be precise) protocols: https://highscalability.com/the-performance-of-distributed-d...

This looks to be the end of the conversation now. Just wanted to drop in and thank you for your time commenting, pron.

The common discourse is that "XYZ language is close to the metal and therefore Blazing Fast (tm)" people become tribalistic and forgot that this there are engineering considerations and trade-offs all the way down. I appreciate you making the argument for the JVM delivering performant code when a budget matters.