| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ttt3ts 1006 days ago
	You can run 70B LLAMA on dual 4090s/3090s with quantization. Going with dual 3090s you can get a system that can run LLAMA 2 70B with 12K context for < $2K. I built two such a systems after burning that much in a week on ChatGPT.

5 comments

coryrc 1006 days ago

> I built two such a systems after burning that much in a week on ChatGPT.

What are you doing!?

link

ttt3ts 1005 days ago

Have a client with many thousands of csv, json, xml files detailing insurance prices. Fundimentally they all contained the same data but wildly different formats because they were produced by different companies and teams. I used ChatGPT to deduce their format so I could normalize them. Easily underbid their current contractor who was using humans for the work and now I have an easy quarterly billing. :)

TBC, I probably could have optimized tokens but contract was profitable and time critical.

Thanks for sharing!

Would you mind to share all your PC HW (mobo, casing, cooling, etc) for this dual GPU configuration? Thanks.

link

ttt3ts 1006 days ago

The one you could build for under 2K is last gen hardware.

* Chenbro Rackmount 4U Server Chassis RM42300-F (rack mount case Remove the air filter on 120mm fan. Put two decent 80mm exhaust at rear). * Two used air cooled 3090s. About $650 a piece on ebay. Check slot width and make sure everything will fit on your motherboard. Do a burn in when you get them cause used GPUs can be hit or miss. * 5950x CPU (overkill just had it) * 128GB DDR4 * Motherboard with x570 chipset and dual pcie x16. These will birificate to x8 pcie 4.0 lanes to each GPU. This is enough bandwidth to push GPUs to max IME * 1200W+ ATX power supply. * ebay "u.2 pcie 3.84TB" and adaptor for m.2 NVME slot. (again what I had & it is cheap)

If you're going to really beat the thing I would power limit the 3090s to 320w (from 350w). Perf change is not really notable and keeps temps better.

link

efreak 1004 days ago

From people hosting image generation models on Stable Horde I've heard that you can pretty severely underclock/undervolt your GPUs and keep them stable, massively reducing heat output and energy cost without losing nearly as much performance. I'm not sure if this transfers into text generation or not, this was from image generation workers that have a few seconds downtime between requests; however it might be worth a bit of research if you happen to be running consumer GPUs.

----- From TheUnamusedFox, in August: > 3090 down to ~260-270 watts (from 400) with minimal gen speed impact. Same with a 3080ti. It seems to be more stable with image generation than gaming, at least on my two cards. If I try to game or benchmark with this undervolt it is an instant crash.

From another user:

> this undervolting stuff is pretty sweet. > undervolted_limits.png [1] > max_power_limits.png [2] > this is my before and after. > a solid 200 watt drop for only 9.2% loss of performance > not to mention the 30 degree drop in temps

[1]: https://cdn.discordapp.com/attachments/1143237412663869570/1... [2]: https://cdn.discordapp.com/attachments/1143237412663869570/1...

Thank you so much.

Are there any good resources related to expanding context windows, or even just the mechanics of how they actually work as properties of a model?

link

ttt3ts 1006 days ago

Lots. LLAMA 2 was trained on 4K context windows but can run on arbitrary length just the results become garbage as you go longer.

I refer you to https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-c... for an "easy" to digest summary

link

Reviving1514 1006 days ago

Edit: Nevermind, saw you posted elsewhere. Thank you!

Can you share your system specs? I was looking into something similar but my costs were closer to 6 to 8k for the whole system.

link

0x008 1005 days ago

is the $2K you mentioned the total cost of ownership?

link