Hacker News new | ask | show | jobs
by hehdhdjehehegwv 742 days ago
I dropped $5k on an A6000 and I can run llama3:70b day and night for the price of my electricity bill.

I’ve gone through hundreds of millions, maybe billions, of tokens in the past year.

This article is just “cloud is expensive” 101. Nothing new.

5 comments

1B of tokens for Gemini Flash (which is on par with llama3-70b in my experience or even better sometimes) with 2:1 input-output would cost ~600 bucks (ignoring the fact they offer 1M tokens a day for free now). Ignoring electricity you'd break even in >8 years. You can find llama3-70b for ~same prices if you're interested in the specific model.
I answered the financial thinking in another reply, but another factor is I need to know if the model today is exactly the same as tomorrow for reliable scientific benchmarking.

I need to tell if I change I made was impactful, but if the model just magically gets smarter or dumber at my tasks with no warning then I can’t tell if I made an improvement or a regression.

Whereas the model on my GPU doesn’t change unless I change it. So it’s one less variable and LLM are black box to start with.

I may be wrong for Gemini, but my impression is all the companies are constantly tweaking the big models. I know GPT on Monday is not always the same GPT on Thursday for example.

I've worked professionally over the last 12 months hosting quite a few foundation models and fine tuned LLMs on our own hardware, aws + azure vms and also a variety of newer "inference serving" type services that are popping up everywhere.

I don't do any work with the output, I'm just the MLOps guy (ahem, DevOps).

You mention expense but on a purely financial basis I find any of these hosted solutions really hard to justify against GPT 3.5 turbo prices, including building your own rig. $5k + electricity is loads of 3.5 Turbo tokens.

Of course none of the data scientists or researchers I work with want to use that though - it's not their job to host these things or worry about the costs.

So my main motivation is not so much to have the lowest cost, but to have the most predictable cost.

Knowing up front this is my fixed ML budget gives me peace of mind and gives me room to try stupid ideas without worrying about it.

Whereas doing it in the cloud you can a) get slammed with some crazy bill by accident, b) have to think more about what resources testing an idea will take, or conversely c) getting GPU FOMO and thinking “if just upgrade a level all my problems will be solved”.

It works for me, everybody mileage varies but personally I like to budget; spend; and then totally focus on my goals and not my cloud spend.

I’m also from the pre-cloud era, so doing stuff on my own bare metal is second nature.

Super cool, thanks for sharing. Do you mind sharing what you used the hundreds of millions (or billions) of tokens on?
Doing really nuanced classification of documents at very large scale. Needle in the haystack type problems.
Is this at 4-bit quantization? And how many tokens per second is the output?
I’m doing non-interactive tasks, but in terms of the A6000 running llama3 70b in chat mode it’s as usable as any of the commercial offerings in terms of speed. I read quickly and it’s faster than I read.
Hows your ROI?
Absolutely phenomenal.
Are you using it for trading?
Nope, powers some low-level infrastructure-ish stuff.