Hacker News new | ask | show | jobs
by eurekin 1064 days ago
Yes, it runs totally fine. I ran it in Oobabooga/text generation web ui. Nice thing about it is that it autodownloads all necessary gpu binaries on it's own and creates a isolated conda env. I asked same questions on the official 70b demo and got same answers. I even got better answers with ooba, since the demo cuts text early

Ooobabooga: https://github.com/oobabooga/text-generation-webui

Model: TheBloke_Llama-2-70B-chat-GPTQ from https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

ExLlama_HF loader gpu split 20,22, context size 2048

on the Chat Settings tab, choose Instruction template tab and pick Llama-v2 from the instruction template dropdown

Demo: https://huggingface.co/blog/llama2#demo

1 comments

Is there any specific settings to make 2x3090 work together?
Not really? I just got those cards in separate PCI slots and the Exllama_hf handles spreading the load internally. No NVLink bridge in particular. I use the "20,22" memory split so that the display card has some room for the framebuffer to handle display
Do you mean you don't use NVLink or just use one that works? I am under the impression it is being phased out ("PCIe 5 is fast enough") and some kits don't use it.
I don't use NVLink
Interested in this too