With 1TB of RAM you can run nearly anything available (405B essentially being the largest ATM). Llama 405B in FP8 precision fits in H100x8 which is 640GB VRAM. Quantization is a very deep and involved well (far too much for an HN comment).
I'm aware it "works" but I don't bother with CPU, GGUF, even llama.cpp so I can't really speak to it. They're just not even remotely usable for my applications.
> tokens per second
Sloooowwww. With 405B it could very well be seconds per token but this is where a lot of system factors come in. You can find benchmarks out there but you'll see stuff like a very high spec AMD EPYC bare metal system with very fast DDR4/5, tons of memory channels, etc doing low single-digit tokens per second with 70B.
> ill get a GPU too but not sure how much VRAM I need for the best value for your buck
Most of my experience is top-end GPU so I can't really speak to this. You may want to pop in at https://www.reddit.com/r/LocalLLaMA/ - there is much more expertise there for this range of hardware (CPU and/or more VRAM limited GPU configs).
There are a variety of other LLM inference implementations that can run on CPU as well.
[0] - https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#su...
[1] - https://docs.vllm.ai/en/v0.6.1/getting_started/cpu-installat...