| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cafkafk 61 days ago

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

9 comments

Sweepi 61 days ago

"-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other."

But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?

As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.

zamadatix 61 days ago

> But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.

ethbr1 61 days ago

This is ironically a pretty solid use case for (ex VLIW research) ILP-optimizing compilers.

Given knowable runtime hardware usage patterns (huge bursts of memory bandwidth saturation) and a single limited core/thread-shared resource (memory bandwidth), one could optimize for the constraint ahead of runtime.

Because most of the performance optimization levers you have available to pull are (a) trade compute for memory bandwidth (e.g. compression), (b) preload when memory bandwidth is available, (c) optimize the choice of what's in cache when, (d) align to cache size / memory boundaries.

Or tl;dr, try to approximate GPU ISAs at the CPU compiler level. (Which why would anyone but hobbyists, because everyone else just buys pallets of Nvidia/AMD or designs their own ML chips?)

sireat 61 days ago

Fantastic practical achievement!

I wonder if I could get similar or even better performance from similar Dell T7610 workstation with dual Xeons and also 128GB DDR3?

The CPUs are better core wise, but that probably does not make much difference?

It has CPUs 2 × Xeon E5-2697 v2

Cores / threads 24 cores / 48 threads total

Per-CPU cores 12 cores / 24 threads

Base clock 2.70 GHz

Max turbo 3.50 GHz

It is sitting gather dust but reading spead Gemma sounds promising.

fragmede 61 days ago

(purple on black is really hard to read)

You say it runs "at reading speed". Have you benchmarked it?

cafkafk 61 days ago

> (purple on black is really hard to read)

Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.

> You say it runs "at reading speed". Have you benchmarked it?

At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:

llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128

Gives:

  llama_print_timings:        load time =   83911.65 ms
  llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
  llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
  llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
  llama_print_timings:       total time =   11114.98 ms /   134 tokens

So 11.94 tokens per second while it's also playing binary cache and CI builder.

When I do it properly, I'll add it to the blog as well!

fhars 61 days ago

And if you ever run out of things to do in your copious free time, it looks like that PR #1744 was merged without the has_target_ctx assert two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-).

ethbr1 61 days ago

> two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-)

2010s Javascript, putting down the controller: Ha, no one will ever surpass my high score for wasting programmer time with dependency churn...

2026 Open Source ML: Hold my beer.

ekianjo 61 days ago

20 tokens per second for eval time is the killer here. It means you can't use this to process any meaningful amount of text.

A GPU typically processes close to 1000 tokens/s during eval.

hnfong 61 days ago

The prompt is literally "why is the sky blue?" and consists of 7 tokens.

It's probably too small for the timings to be taken seriously.

boutell 61 days ago

I'm pretty sure eval time is token generation time where it's actually outputting new tokens. If you're getting a thousand per second on that, I'd love to know on what.

Majromax 61 days ago

From the prompt timings above, it seems like 'prompt eval time' is the equivalent to 'processing time for input tokens'.

Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.

The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).

In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.

bboozzoo 61 days ago

Seven tokens long input isn't very realistic, is it? For coding tasks it's normal for the input to be thousands or 10s of thousands. If it wasn't for prefix caching it'd be one miserable experience, but even then at the very best the input is often in hundreds each time. And don't even try to dump some logs into the prompt.

throwawayffffas 60 days ago

He meant prompt eval time, but have a look at these guys: https://www.youtube.com/watch?v=ndSA9T5yvmM

Over 2500 tokens per second on a single request. With 8 MI300X.

ekianjo 61 days ago

I meant prompt eval time.

bbatha 61 days ago

What's time to first token? Raw throughput is usually not the problem in local setups in my experience.

anon-3988 61 days ago

I am pretty sure llamacpp have their own benchmarking binary that you can use.

mft_ 61 days ago

llama-bench is part of the llama-cpp package, but from recent experimentation, the settings it is able to (or is documented to?) accept lag behind somewhat. Not sure whether it would accept all of the esoteric settings in the article?

gdjdhdheb 61 days ago

You sure you got DDR3 .. I have 2 e5 v4 rigs at home and both have ddr4 ... Unless I am wrong and 2011-3 supports ddr3 and ddr4

duffyjp 60 days ago

I won't speak for cafkafk, but I have two E5 (v3/v4) systems one on DDR4 and one on DDR3. This generation of CPU all support DDR4, but a few skus do support DDR3 also. ChatGPT told me they were niche products to meet specific customer needs.

I just picked up the DDR3 board, an Aliexpress "XD3" so I could reuse some DDR3 ram on a better CPU. Quad channel 1866MT/s is not bad!

lightedman 61 days ago

The first two generations supported DDR3 only. Haswell and Broadwell (v4) brought DDR4 support.

_zoltan_ 61 days ago

right, and they talk about "v4" which is DDR4.

lightedman 60 days ago

There were several V4 Xeon models that supported DDR3 AND DDR4 simultaneously. If you had a motherboard with an X79 chipset it would (sometimes) work properly.

_zoltan_ 60 days ago

I am not aware of any commercial vendor shipping v3/v4 boards with DDR3. I have a couple hundred Supermicro systems that are stuck on v2 CPUs with DDR3...

lightedman 60 days ago

Get a 2696 v4 or 2686 v4 and a X79 motherboard and you should be able to use DDR3.

dawnerd 60 days ago

I have a dual e5 v3 that had ddr 4 as well. Been going strong for ten years and still overpowered for what I use it for.

_hyn3 60 days ago

You're right - the article says 'CPU: Intel Xeon E5-2620 v4 @ 2.10 GHz' but also says DDR3. And the specs page for that CPU (https://www.intel.com/content/www/us/en/products/sku/92986/i...) clearly says the 2620 v4 is DDR4.

E5 CPUs have their supported RAM right on the Intel ARK pages, but short version:

E5-xxxxx v1 and v2 are all DDR3

E5-xxxxx v3 and v4 are all DDR4

Not sure why Intel didn't just cut new model numbers instead of keeping them all as "e5"

More concrete example for E5-2660 (great processor) showing v1 and v2 support DDR3, while v3 and v4, DDR4 (again, different motherboards)

DDR3 v1: https://www.intel.com/content/www/us/en/products/sku/64584/i...

DDR3 v2: https://www.intel.com/content/www/us/en/products/sku/75272/i...

DDR4 v3: https://www.intel.com/content/www/us/en/products/sku/81706/i...

DDR4 v4: https://www.intel.com/content/www/us/en/products/sku/91772/i...

This also means that you need to know the processor your motherboard supports (or, easier, probably RAM) before putting in an order to upgrade the processor. (These processors are incredibly cheap, less than $10 for something that might have cost literally thousands ten years ago, so worthwhile to spend a few minutes and pick out your favorite based on cores, watts, Ghz, etc.)

(Another commenter says that there are some motherboards that accept v3/v4 but also can run slower DDR3 RAM. That's new to me and quite cool - DDR3 is extremely cheap, even now. I did find these motherboards on aliexpress, too: https://www.aliexpress.us/w/wholesale-XD3-motherboard.html?s... and one clearly says v3/v4 cpu's with DDR3 RAM. That could be very useful although memory speeds are slower since CPU performance can be boosted with v3/v4.)

v1: https://www.intel.com/content/www/us/en/ark/products/series/...

v2: https://www.intel.com/content/www/us/en/ark/products/series/...

v3: https://www.intel.com/content/www/us/en/ark/products/series/...

v4: https://www.intel.com/content/www/us/en/ark/products/series/...

m463 60 days ago

I bought a renewed 2x E5-2690v4 server (28c/56t) 128gb on amazon for under $500 2 years ago (28c/56t) dell T7810

search amazon for "chia farming" ...and scroll past chia seeds :)

now same machine is 2.5x the price

https://www.amazon.com/dp/B095TRGCSX

but way cheaper than current ddr5 machines

justinram11 60 days ago

Bought the exact same machine (same config and ram as well) around the same time off ebay for ~$280. Part of me wonders if I should sell it, but I do occasionally like to play with homelab stuff.

I have a 3060 12gb card I'd love to hook up to my PoE Reolink cameras for face detection and to get off of the Reolink app.

overfeed 60 days ago

> now same machine is 2.5x the price

2.5x?! I have a bunch of older Haswell servers I got for free that are rotting away in my garage. I had initially thought of stripping out the ECC DDR4, but now I'm wondering if I'll get takers on Marketplace...

sixothree 60 days ago

Honestly, if someone can actually use them (as demonstrated by paying the price+shipping) then they would probably have a better home with that person.

dark-star 61 days ago

Something doesn't add up here. As someone who has only recently built a home-server from an E5-26xx v2 on DDR3 RAM (because I have a sh*tload of 32g DDR3 DIMMs), I can confidently say that the newer cores (E5-26xx v3 and v4) only run on DDR4 memory...

So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Everything else doesn't work

mwpmaybe 61 days ago

There are some OEM-only v3/v4 parts with dual memory controllers (because of a RAM supply crunch at the time, funnily enough), but the E5-2620 v4 is not one of them. The classic example is the very popular 12-core E5-2678 v3.

robeastham 61 days ago

This is not true. A few well known brands made both DDR3 and DDR4 servers that support v3 & v4 chips. Ask me how I know :-)

dark-star 60 days ago

crazy, I really did not know that. Do you happen to know if such boards also exist that take registered DDR3 RAM? None of them explicitly call out DDR3-R RAM so I assume they only take consumer RAM?

smartbit 61 days ago

enlighten us

bobmcnamara 60 days ago

https://www.aliexpress.com/s/wiki-ssr/article/2696-v4-ddr3

happycube 61 days ago

It looks like Supermicro had some DDR3 Xeon v3/v4 boards, and the first thing that came to mind was a Shenzen workstation/gaming board using recycled parts... haven't searched on that but it's bound to exist.

TacticalCoder 61 days ago

> So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Yup that's odd... I've got a Xeon 2680 v4 (14 cores) (amazing bargain of a little beast btw) and it's indeed on DDR4 and I saw all Xeons v4 as supporting DDR4 only.

Full spec (brand/model/mobo type) would have been nice: mine's an HP Z440 workstation repurposed as a server (which I only turn on when I'm working and which I religiously turn off before going to bed).

justinclift 61 days ago

Yeah, the Intel reference page only lists DDR4, not DDR3:

https://www.intel.com/content/www/us/en/products/sku/92986/i...

Lerc 61 days ago

This seems remarkably suited to my situation,

    CPU(s): 32
      On-line CPU(s) list: 0-31
    Vendor ID: GenuineIntel  
    Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz

Also with 128G. Does 8 dimm sockets imply more actual bandwidth in practice?

This poor thing is currently a YouTube watching box.

miahi 61 days ago

One thing to note: These Xeons have quad memory channels, that usually means double the bandwidth of an equivalent desktop CPU, if you populate all the slots.

I have a dual E5-2667 v2 server with 512GB DDR3 and it's quite nice, the memory bandwidth is higher than of a DDR4 desktop with a way newer CPU, even though it's ECC and registered.

arpinum 61 days ago

How many watts is that setup? Cool you got it to work, but maybe only useful for vintage / retro computing rather than practical if the energy consumption makes it economically wasteful.

vetrom 61 days ago

IDK about OPs setup, but I run a pile of E5-2683v4 Xeon recycled servers for Ceph and self hosted business SaaS usage.

One node's ipmitool sensor report (and self-monitoring PSU, so grain of salt, but my UPS side monitoring tracks closely), reports 250-300w average power use. This though, mind you is for running 22 spinning disks, 2 SAS/SATA SSDs, and 4 NVME ssds, and 768GB of DDR4.

Mid-gen 2015ish Xeons were not great at power reduction, but if you are pegging the cores, they were never particularly slow, and they did have lots of PCIe lanes. This boils down to the CPU/mobo itself not being that big a cost floor, especially if you have high utilization rates.

As a comparison, my main desktop development machine, running a Threadripper 9970X, 128GB of DDR5, a RDNA4 GPU, and a small pile of NVME drives has a power floor of roughly 250W. Some CPU centric workloads you'll definitely lose out on on the older gens of machines, but they are by no means impractical.

Maybe for a desktop usecase they are absolutely suboptimal nowadays, but for a lot of realworld usecases I would say they're still relevant.

---

Like the author posts for the LLM usecase, I think optimizing the hardware choice to the application and not leaving levers unpulled is a big key, especially considering how wide a variety of bandwidth/power draw/peak frequency/corecount SKUs exist in the Xeon lines. Without knowing what you intend to run and fitting the correct processor to it, you will end up with a disappointingly poor environment fit.

RetroTechie 61 days ago

How many kWh to fabricate a brand new machine better suited to the task?

As long as performance is useable (apply your own metrics!), pulling it from existing hardware is likely the option with the lower eco footprint.

Also: chances are it'll only be used for this purpose occasionally, and/or for a short while. In that scenario [fabricating new hardware] always has the bigger eco footprint.

dangus 61 days ago

I don’t know why you’d assume that an older system is lower footprint.

If you’ve got something consuming 100 watts average over your 24 hour period, and your electricity costs 20 cents per kWh, you’re already spending almost as much as a Claude subscription.

Just on electricity, this assumes your hardware never fails and you never incur any additional costs.

There’s a big reason why newer more efficient hardware is in demand. Something that’s 10+ years old has drastically worse performance per watt.

Obviously I am not saying to throw away your old hardware as a rule but there is a point where some of this old stuff just isn’t even worth running.

quietsegfault 61 days ago

I have two LARGE Xeon systems of this era that I used to use when I was heavily involved with Kubernetes and needed to build out a home lab. One is 2x Xeon w/ 256 GB of ram, and one is 1x Xeon w/ 512GB of ram. Both are slow as dogs, and both of them take up at least 150+ watts with only one power supply. My 12th gen Intel Nuc is so, so much faster and efficient. I'm recycling the Xeon systems.

gnerd00 61 days ago

Xeon is a group of products with really varying specs. There is no indication of which XEONs. Also new consumer CPUs often have really small internal caches.

dangus 60 days ago

The Xeon processor in use by the OP of this article claims to have 20MB of Intel “Smart Cache.”

An Apple M4 chip in a Mac mini has 16MB on the P-cores and 4MB on the E-cores.

Depending on use case, AMD 3D V-cache at almost 100MB could also work out quite well.

So really, if you wait long enough, consumer chips end up with a pretty similar amount of cache.

quietsegfault 60 days ago

E5-2690s in my case.

ThatMedicIsASpy 60 days ago

The reason more performance/watt is in demand because a datacenter can't suddenly draw twice as much power.

dangus 60 days ago

Or because I don’t want my homelab to spike my electricity bill and give me a loud hot closet.

souterrain 61 days ago

You mention lower footprint but then make a cost comparison against Claude subscription pricing.

Claude subscription pricing is a broken way to consider footprint.

dangus 60 days ago

You can call it whatever you want, money is money, and money spent on energy is footprint.

shevy-java 61 days ago

Would you consider improving the website's layout? Right now I find it below average quality and very distracting. Whether you are an engineer or not is not really important; great engineers can write horrible text or use a layout that is not ideal, for instance.