|
|
|
|
|
by kir-gadjello
744 days ago
|
|
While llama3-8b might be slightly more brittle under quantization, llama3-70b really surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime. It requires one of the most advanced quantization methods (IQ2_XS specifically) to work, but the reward is a SoTA LLM that fits on one 4090 GPU with 8K context (KV-cache uncompressed btw) and allows for advanced usecases such as powering the agent engine I'm working on: https://github.com/kir-gadjello/picoagent-rnd For me it completely replaced strong models such as Mixtral-8x7B and DeepSeek-Coder-Instruct-33B. 1. https://www.reddit.com/r/LocalLLaMA/comments/1cst400/result_... |
|