Hacker News new | ask | show | jobs
by tarruda 781 days ago
Note that 2 RTX 3060 will probably be significantly slower than RTX4090.

Even with RTX 4090, 2 tokens per second is very slow and likely not ideal for most tasks. It is impressive (much faster than previous solutions), but still very slow for real time use.

If you want to run Llama 3 70b, might be better to purchase a mac studio with 64gb RAM (more for longer contexts) and run with 4-bit quantization.

My 2 cents: For most common tasks Llama 3 8b will be more than enough, and you can run that with full precision using a single rtx 3090. At a much lower cost, you can also run Llama 3 8b with 8-bit quantization in a single RTX 3060, if it has 12GB RAM.