| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zyx321 311 days ago
	There's been some theories floating around that the 128gb version could be the best value for on-premise LLM inference. The RAM is split between CPU and GPU at a user-configurable ratio. So this might be the holy grail of "good enough GPU" and "over 100GB of VRAM" if the rest of the system can keep up.

1 comments

yencabulator 311 days ago

> The RAM is split between CPU and GPU at a user-configurable ratio.

I believe the fixed split thing is a historical remnant. These days, the OS can allocate memory for the GPU to use on the fly.

link

geerlingguy 311 days ago

Indeed it can be reallocated, needs a reboot though. I've gotten up to around 110 GB before running into OOM issues. I set it at 108 GB to provide a little headroom: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

link

yencabulator 311 days ago

Also, from your link:

> It seems like tools will have to adapt to dynamic VRAM allocation, as none of the monitoring tools I've tested assume VRAM can be increased on the fly.

amdgpu_top shows VRAM (the old fixed thing) and GTT (dynamic) separately.

link

geerlingguy 310 days ago

Good to know!

link

yencabulator 311 days ago

No need for a reboot, echo 9999 >/sys/module/ttm/parameters/pages_limit

You're talking about an allocator policy for when to allow GTT and when not, not the old firmware-level VRAM split thing where whatever size the BIOS sets for VRAM is permanently away from the CPU. The max GTT limit is there to decrease accidental footguns, it's not a technological limitation; at least earlier the default policy was to reserve 1/4 of RAM for non-GPU use, and 1/4*128 GB=32GB is more than enough so you're looking to adjust the policy. It's just an if statement in the kernel, GTT the mechanism doesn't limit it, and deallocating a chunk of memory used by the GPU returns it to the general kernel memory pool, where it can next be used by the CPU again.

link

zyx321 311 days ago

It's not a fixed split. I don't know if it's possible live, or if it requires a reboot, but it's not hardwired.

I want to know if it's possible. 4GB for Linux, a bit of room for the calculations, and then you can load a 122GB model entirely into VRAM.

How would that perform in real life? Someone please benchmark it!

link

yencabulator 311 days ago

You're still thinking of the old school thing, where you set the split in the firmware and it's fixed for that boot. There's dynamic allocation on top of it these days.

I have that split set at the minimum 2 GB and I'm giving the GPU a 20 GB model to process.

link