Hacker News new | ask | show | jobs
by tarruda 876 days ago
> You would need multiple GPUs with shared memory if you wanted to offload the higher precision models to VRAM.

Or just a powerful apple silicon machine? I've tried dolphin mixtral 4bit on a 36gb ram MacBook m3, and inference is super fast.

3 comments

Or a Linux machine with a Ryzen using the internal GPU and the unified RAM (scroll down at llama.cpp and look for ROCm).
Wait ROCm support Ryzen APUs and still doesn't support dedicatedly GPUs like the 6700XT?!
Supports* dedicated*
While not being officially supported, rocm runs just fine on my 6700XT, i just have to set an env var(export HSA_OVERRIDE_GFX_VERSION=10.3.0)
Really? Does everything run? Even AI stuff? Do you have any links where I can read more about that?
Everything I've tried to get running, worked quite smoothly. Although I only tried LLMs via llama-cpp and stable diffusion via ComfyUI. I don't see any reason why other AI stuff wouldn't work as long as it supports rocm.

Also I only tried it on linux, AFAIK windows is a lot more difficult to get running, if it works at all...

With llama-cpp, I successfully tried various LLMs(e.g. LLAMA 13B, Mixtral etc) with very solid performance. Even for models that don't fit in VRAM completely, performance can be surprisingly solid, as long as you compile with AVX extensions. (and your CPU supports those)

Stable Diffusion via ComfyUI also works very well. However, be aware of VRAM limitations with the larger SDXL variants, especially when running a heavy desktop environment.

Regarding setup guides/links, there isn't a good centralized resource sadly, so some tinkering is needed. Unlike some of those CUDA 1-click solutions, ROCm requires more manual setup, especially for the models only unofficially supported.

Here are a couple of links that might be helpful:

https://old.reddit.com/r/LocalLLaMA/comments/18ourt4/my_setu...

https://old.reddit.com/r/StableDiffusion/comments/ww436j/how...

https://rentry.org/eq3hg

In general the r/localllama & r/StableDiffusion subreddits are good places to search for info.

Or a jetson orin agx (~2k$). Probably the cheapest way to get an Nvidia GPU with 64 GB of RAM.
I wonder what would be the cheapest way to run an LLM, with the latest Ryzen integrated graphics and 64G Ram or the Jetson AGX Orin 64. https://www.nvidia.com/en-us/autonomous-machines/embedded-sy...
The Ryzen is a lot cheaper, but most likely also a fair bit slower. You'd be looking at a 200$ CPU, 200$ Motherboard + 200$ of ddr5 ram. Throw in a case, nvme drive and power supply and you're still below $1k and those numbers are quite generous estimates, you could do it a lot cheaper by going AM4 with DDR4 ram.
Have you tried this yourself? Curious to know how well this works for an LLM home lab.
I’ve worked with Jetson going back to the TK1 and I highly recommend you do not do this.

Nvidia has significant dominance in the AI space because of their work on software and the overall platform.

With the Jetson line being the sole exception. Use it for what it’s for - a targeted build for an embedded/specific application requiring small size and low power.

The software is a mess. Support for Jetson (generally) is a far afterthought or not considered at all around projects at Nvidia and the broader ecosystem. When it is supported at all it lags behind significantly, using ancient distros (Jetpack), etc. To make matters worse the user base is so (relatively) tiny there are bugs and strange behavior everywhere.

Just don’t do it.

This is a bit surprising to hear. Current Jetpack 6 is Ubuntu 22.04 - this is the current Ubuntu LTS release. There's nothing ancient about it, no? I'm pretty sure, if I go and check versions of CUDA, PyTorch, Tensorflow - it'd be also relatively recent.

I'd suggest checking what examples are available, see what community is doing, see if what you need had already been tried - https://www.jetson-ai-lab.com

From what I've seen, mainstream LLM libraries like VLLM, llamacpp that use CUDA under the hood tend to work out-of-the-box. And there are tutorials available: https://www.jetson-ai-lab.com/tutorial_text-generation.html. I think that TensorFlow/Pytorch are also well maintained, although I've not checked recently.

I think this perspective comes from a lack of historical experience and hands-on experience overall.

Nvidia more broadly has very impressive support for their GPUs. If you look at the support lifecycles for their Jetson hardware over time it's significantly worse. I encourage you to look at what support lifecycles have looked like, with the most "egregious" example being dropping of support for the Jetson Nano in from what I recall was within a couple of years.

Another consideration - Jetson is optimized for power efficiency/form-factor and on a per $ basis CUDA performance is terrible. The power efficiency and form-factor come at significant cost. See this discussion from one of my projects[0]. I evaluated the use of WIS on an Orin Nano that I have and it was nearly 10x slower than a GTX 1070 which is seven years old and is still supported by the latest drivers and CUDA 12 on whatever OS you want.

Nvidia knows what they're doing in terms of productization and the Jetson line should not be seen as some kind of secret hack/unlock for getting CUDA performance with gobs of RAM. In the case of LLMs I wouldn't be surprised at all if CPU beats it and at that point pickup 256GB of RAM or whatever for equivalent cost.

In the end what do I care what people use, I'm offering the perspective and experience of someone who has actually used the Jetson line for many years and frequently struggled with all of these issues and more.

[0] - https://github.com/toverainc/willow-inference-server/discuss...

I have a Jetson as well, and you are sorely mistaken. Just reading the doc pages everything seems nice and well, but Nvidia deprecates these little boards like no other. No support after you've bought the thing, and everything is kept frozen. (ie no new python, no new python dependencies, etc) What they aren't telling you is that specific sub-versions within each jetson/orin family board have differing support (ie not what they say on that website you are reading), and it's up to you to figure it out.

I've gotten my Jetson to work well using Yocto to build my own linux distros with correct updated dependencies, libraries and updated jetpack, but it's not for the faint of heart, and that's a whole other ball of yarn. It also takes a few hours to generate a new build every time I need to update some dependency that depends on other dependencies (Yocto maintenance is a full time job in many embedded development shops - you're basically authoring your own distribution).

Treat these devices as what they are: embedded target boards for fixed industrial development (for example, to go into a robot or a car - once that design is finished, Nvidia will expect you to NEVER update any part of the system with an embedded jetson or orin system for years, until you replace the whole thing with their newest model that you buy off the shelf again).

This is standard fare in embedded and robotics space. Do not use these boards for any kind of rapidly moving software development, because it's the wrong tool for the job.

Yes, it's all rather recent in my experience. You get CUDA 12 and the newest Pytorch.
According to this article [1] it looks like there is no complex preparation needs to run the inference on a Jetson system. Should work with Mixtral too.

[1] https://www.hackster.io/pjdecarlo/llama-2-llms-w-nvidia-jets...

I haven't tried it for LLMs yet, i use it for real time RF processing, but I actually have one of them on my desk and they are fun little devices.

Maybe I will try to get a 32 GB+ LLM running one of those days.

What? I can do this? Runs to the PC

EDIT: I cannot, I need to install ROCm to compile with it, and then install something called hipBLAS, and who knows what else.

Well, yes, you need to install ROCm and depdendencies. Have a look at https://rocm.docs.amd.com/projects/install-on-linux/en/lates... Debian trixie (not yet released) has most dependencies as packages. Or you can try a docker container https://rocm.docs.amd.com/projects/install-on-linux/en/lates...
I'll try that, thanks!
OpenCL should also work on AMD cards, and is way easier to install
It is dead slow on integrated graphics, unfortunately.
Does that let me use unified memory on the GPU, though? Or is it just so I can use my CPU memory?

EDIT: Oh, no, I have an nVidia GPU, AMD CPU.

I bet your AMD CPU has an internal GPU, too. That's what you can use with the unified memory.
How much RAM are you able to set aside for a ryzen igpu?
I think my motherboard allow me to dedicate 12. I didn't see any improvement using CPU + ROCm compared to CPU alone. Using CPU alone I can get 4.2 - 5 Tokens/s, with ROCm I can get 4.5 - 5.2 T/s. With CPU + RTX 2070 8GB I get 6.2-7 T/s.
How fast is it with a setup like this?
I can run 4bit on a beat up 1070 ti. GP talks about higher precision models
You wouldn’t be able to fit the whole model into 8GB VRAM. It’s faster than not using a GPU at all, but most of it would still be computed on the CPU.
IME ollama ran mixtral on a 1070 fast enough.
Though it most probably does not run in on the 1070 but rather on the cpu. It cannot fit on a 1070, it is not about speed, a 1070 cannot run it period.
In llama.cpp You can offload some of the layers to gpu with -ngl X. Where x is the number of layers
Did you do anything special to make that work? Is it useful? Or just a toy?
I have a 14" MBP with an M1 Max and 64GB. The M3 won't really make a difference, but the RAM, since unified, is huge. I can run most models on this machine with realtime performance compared to a Ryzen 7735HS and 64GB (DDR5). Now I'm not saying the Ryzen setup should be good, but the M1 architecture just makes it a much better option. I could add an eGPU to the Ryzen system and it could likely do better, but would also exceed the price point and portability.
it's not just that it's huge and unified - ryzen APUs obviously can have 2x32GB SODIMMs put in them and they support unified memory too.

the difference is the bandwidth and the computational power of the APU. M1 Max is roughly similar to a PS5 in terms of overall system design (shader configuration and bandwidth) plus has dedicated AI inference units already (which won't be added to consoles until PS5 Pro launches with RDNA 3.5). It is far more bandwidth than you can get out of a socketed-memory laptop system.

https://twitter.com/Locuza_/status/1450271726827413508

To support that level of performance in a socketed-memory system you will need an extra layer of caching added to the processor to supplement the bandwidth - and maybe still need to go to quad-channel. Those products are Strix and Strix Halo and should be hitting the market over the next year or two but the reality is that the M1 Max was an absurdly powerful laptop, far more potent than even the first-gen 5nm laptops for x86 let alone the other junk you could buy in 2020.

This is the problem with the discourse around apple silicon for the last few years: yeah, they're expensive, but even a loaded-out x86 laptop doesn't get you the same capabilities. Even if the x86 is competitive in some particular benchmark on iso-node you are probably spending more power to do it, and the x86 product comes years after the apple product, and still has a much weaker gpu and less bandwidth (which doesn't just matter for GPU, it matters for compiling and JIT too).

It is incredibly silly to look back on the discourse in 2020-2023 around apple silicon, a lot of reviewers made extremely silly claims about how "even 7nm x86 processors were already competitive with apple silicon" and as the ecosystems have matured it is obvious that even 5nm processors are not quite competitive yet. And they dumped on the SPEC tests and Geekbench that measured this properly, in favor of dumb things like cinebench R23 and so on (it's always cinebench used for this dumb shit tbh, CB R13/R15 were hugely misleading at the zen1 launch too). Let alone things like, you know, compiling or JVM/node workloads...)

(similarly, gotta love the vibe a few years ago of: "threadripper vs mac pro" - did you know that a 64C threadripper with 256GB RAM is actually cheaper than a mac pro loaded out with 2TB!? waow, who knew systems with an order of magnitude less capacity would be cheaper!? https://youtu.be/BH291DQRIOg )

I've had less luck with Mixtral, but I run Yi 34B finetunes for general personal use, including quick queries for work.

Its kinda like GPT 3.5, with no internet access and slightly less reliable responses, but unrestrained, much faster and with a huge (up to 75K on my Nvidia 3090) usable context.

Mixtral is extremely fast though, at least at a batch size of 1.

Which Yi 34B finetunes are you using that have a 75,000 token length?
All of the Yi 200K finetunes should support it, but you have to be careful because some degrade the base model's quite excellent long context performance more than others. The very strong Bagel 34B DPO model, for instance, basically doesn't work at long context.

Nous Capybara is a popular one. I personally use my own merge of many models, and you can look through the constituent models to see if any interest you: https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megame...

You can't really use llama.cpp for super long context btw, its just too slow and vram inefficient at the moment.

Nothing special other than llama.cpp, which is an inference engine optimized for apple silicon.

I heard you can simply install ollama app which uses llama.cpp under the hoods, but has a more user friendly experience.

I've been using it for 'easy' queries like syntax/parameter questions, in place of ChatGPT 4. It's great for that. I am using a ~48GB version.