Hacker News new | ask | show | jobs
by int_19h 1088 days ago
llama-30B (which is actually 33B) and derivatives generally run fine with 4-bit quantization on a single RTX 3090 or 4090, although depending on group size used for quantization you may need to slightly dial down the context size.