| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by loudmax 1006 days ago
	You can run the smaller Llama variants on consumer grade hardware, but people typically rent GPUs from the cloud to run the larger variants. It is possible to run even larger variants on a beefy workstation or gaming rig, but the performance on consumer hardware usually makes this impractical. So the comparison would be the cost of renting a cloud GPU to run Llama vs querying ChatGPT.

1 comments

ramesh31 1006 days ago

>So the comparison would be the cost of renting a cloud GPU to run Llama vs querying ChatGPT.

Yes, and it doesn't even come close. Llama2-70b can run inference at 300+tokens/s on a single V100 instance at ~$0.50/hr. Anyone who can should be switching away from OpenAI right now.

link

cheptsov 998 days ago

How do you fit Llama2-70b into V100? V100 is 16GB. Llama2-70b 4bit would require up to 40GB. Also, what do you use for inference to get 300+tokens/s?

link

thewataccount 1006 days ago

What's the best way to use LLama2-70b without existing infrastructure for orchestrating it?

link

mjirv 1006 days ago

I stumbled upon OpenRouter[0] a few days ago. Easiest I’ve seen by far (if you want SaaS, not hosting it yourself).

[0] https://openrouter.ai

link

ramesh31 1006 days ago

>What's the best way to use LLama2-70b without existing infrastructure for orchestrating it?

That's an exercise left to the reader for now, and is where your value/moat lies.

link

thewataccount 1006 days ago

> That's an exercise left to the reader for now, and is where your value/moat lies.

Hopefully more on-demand services enter the space. Currently where I am we don't have the resources for any type of self orchestration and our use case is so low/sporadic that we can't simply have a dedicated instance.

Last I saw the current services were rather expensive but I should recheck.

link

pdntspa 1006 days ago

I bought an old server off ServerMonkey for like $700 with a stupid amount of RAM and CPUs and it runs Llama2-70b fine, if a little slowly. Good for experimenting

link