| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by spockz 1 day ago

For what it is worth, I’m on a similar machine. (9070XT,5900X) and found a lot of performance improvement over ollama by compiling llama.cpp and running with —no-mmap and —perf. The context is still quite small though. With online models I use contexts of at least 200k which is useful for longer running/more complicated commands.

Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.

I haven’t tried any tool that compresses the tokens yet.

1 comments

echelon 1 day ago

I would rather we give up the idea of running open models on RTX cards and instead focus on running much bigger open models on H200s.

1. The hardware will eventually catch up.

2. This keeps the delta between frontier models smaller.

3. We can still fine tune and own the weights.

4. The models will be more useful, faster, and reliable.

RTX is hobbyist tier, not professional tier.

Gated cloud models from hyperscalers treat us like hobbyists in their own right.

We need equivalent scale models, but open.

link

zozbot234 1 day ago

H200s and other enterprise datacenter GPUs are completely overkill in any realistic single- or few-users inference scenario. They're hugely unbalanced towards compute capacity which will go almost entirely unused (i.e. wasted) unless you're running huge batches on a continued basis. I've argued many times that local inference engines should support batched inference on a somewhat smaller scale for a variety of reasons (especially given the unexpected effectiveness of SSD streamed inference with larger-than-RAM models), but even I don't think we can realistically go to 300x or so for real-time inference, which is the range that pencils out quite consistently from a simple roofline model of these datacenter cards.

link

echelon 1 day ago

If you're doing professional work in coding or video, you can easily saturate a single H200.

This is what RunPod-type services are for.

For instance, ComfyUI is an abomination that can't do half of what Nano Banana and Seedance 2.0 can do. And you have to sit around and wait 10x longer for single results.

I can rent an H200 for $3.50 an hour. That's INSANELY cheap.

I do not understand this split between hosted APIs and rinky-dink local RTX models. Both suck.

The ideal solution is models we own run on RunPods leveraging H200s.

I can spend $100-200/day on compute making much more value with the model outputs.

----

edit: I want to respond to comments, but the damned HN rate limits keep me to five comments a day now because I'm a contrarian and say things that rile up the anti-AI folks.

You don't need to buy an H200. It's a depreciating asset. You rent one. It's cheap to rent.

link

spockz 1 day ago

Sure, to approach frontier model quality locally we need to have more power. And H200s are a way to get there.

However, we need to use the tools that we have. Even if I wanted to buy a (bunch of) H200 for me and my colleagues and could get the expense approved, they are hard to source where we are.

Yes. You can rent them, but I’m not sure how that affects the IP discussion.

Moreover, not everyone is doing coding and video so we have different tasks that can fit quite well on relatively light laptops (Gemma et al), for relatively directed coding sessions we can make do with RTX cards, or a small step up, all the way to H200 in the workstation. Or pods thereof.

We have the graphics cards and laptops with MLX right now. The H200 will take a year at least to arrive. Better get used to run stuff locally.

link

zozbot234 1 day ago

I'll definitely believe that for video generation models, but those are also very compute-intensive for rather middling results.

link

what 19 hours ago

> I’m a contrarian that says things that rile up the anti-AI folks

That’s hardly contrarian here, lol.

link

echelon 19 hours ago

Are we experiencing the same website?

I swear, two thirds of the folks here just make comments that dunk on AI. They underestimate it, hate it, hate those that use it, etc. It's the "old angry man yells at cloud" trope.

I've had so many consecutive days of "-4" karma posts that HN is blocking me from commenting. And the comment retorts I get from these folks are absolute gems that will undoubtedly age like milk.

link

SR2Z 1 day ago

That GPU costs 25k which means you really should have a rack to put it in. It's not realistic.

link

dofm 1 day ago

Pressure on small model quality and design is absolutely what is needed. There are still gains to be made.

link

MrLeap 1 day ago

There's a lot more professionals that have RTX cards than H200s. You're inevitably see more development and experimentation on things actual humans have lmao.

link

FridgeSeal 19 hours ago

Ah yes, because of all the people at home with computers who have…checks notes…datacentre GPU’s lying around.

link