| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by johnc1 1243 days ago

> Can these models feasibly be run locally?

Actually you can, it even works without GPU, here's a guide on running BLOOM (the open-source GPT-3 competitor of similar size) locally: https://towardsdatascience.com/run-bloom-the-largest-open-ac...

The problem is performance: - if you have GPUs with > 330GB VRAM, it'll run fast - otherwise, you'll run from RAM or NVMe, but very slowly - generating one token every few minutes or so (depending on RAM size / NVMe speed)

The future might be brighter: fp8 already exists and halves the RAM requirements (although it's still very hard to get it running), and there is ongoing research on fp4. Even that would still require 84GB of VRAM to run...

5 comments

Towaway69 1243 days ago

From guide linked above:

> It is remarkable that such large multi-lingual model is openly available for everybody.

Am I the only one thinking that this remark is a insight into societal failure? The model has been trained on global freely available content, anyone who has published on the Web has contributed.

Yet the wisdom gained from our collective knowledge is assumed to be withheld from us. As the original remark was one of surprise, the authors (and our) assumption is that trained models are expected to be kept from us.

ornornor 1243 days ago

I think it’s similar to how search engines keep their ranking formulas secret, and you can’t run your own off a copy of their index.

Yet we also all contributed to it by publishing (and feeding it, for instance by following googles requirements for micro data). But we don’t own any of it.

capableweb 1243 days ago

Main difference with a search engine is that a search engine ultimately links back to you. So the user, interested in more or want to know where it comes from, ends up on your website.

The same is not true for these AI tools. The output could have been contributed by you, someone else, or everyone, or a combination of those, but it'll never be clear who actually contributed and there will be no credit to anyone besides the author(s) of the models.

ornornor 1243 days ago

Didn’t think of it this way, that makes sense. Thank you

lacasito25 1243 days ago

How much money you think gpt3 training costed?

Towaway69 1243 days ago

How much money do we spend contributing to the training set?

Those insights, comments, articles, code example, etc are free to use because we published those on sites that don't own the content but earn from it. If they owned them, the they would be responsible for hate speech.

So our costs for producing the training set is negligible.

PartiallyTyped 1243 days ago

I recommend reading the first few chapters of "The conquest of bread".

Dylan16807 1243 days ago

If it fits in system memory, is it still faster on GPU than CPU? Does that involve swapping out one layer at a time? Otherwise I'm very curious how it handles the PCIe latency.

Enough system memory to fit 84GB isn't all that expensive...

tempay 1243 days ago

Yes, the connection between system memory and the GPU isn’t fast enough to keep the compute units fed with data to process. Generally PCIe latency isn’t as much of a problem as bandwidth.

adam_arthur 1243 days ago

Pretty cool!

Honestly even if it were to take a few minutes per response, that's likely sufficient for many use cases. I'd get value out of that if it allowed bypassing a paywall. I'm curious how these models end up being monetized/supported financially, as they sound expensive to run at scale.

The required disk space seems the biggest barrier for local.

afro88 1243 days ago

If it's a few minutes per token you might be waiting a lot longer for a full response: https://blog.quickchat.ai/post/tokens-entropy-question/

I also wonder how open.ai etc provides access to these for free. Reminds me of the adage from when Facebook rose to popularity: "if something is free, 'you' are the product". Perhaps to gather lots more conversational training data for fine tuning.

JackFr 1243 days ago

It would be remarkable and surprising if they weren’t doing that.

int_19h 1241 days ago

It's in their FAQ:

>> Who can view my conversations?

> As part of our commitment to safe and responsible AI, we review conversations to improve our systems and to ensure the content complies with our policies and safety requirements.

>> Will you use my conversations for training?

> Yes. Your conversations may be reviewed by our AI trainers to improve our systems.

JellyBeanThief 1243 days ago

Crowd-funded AI training coming soon to Patreon?

justplay 1243 days ago

do it now

logicallee 1243 days ago

> if you have GPUs with > 330GB VRAM, it'll run fast

What kind of GPU's have that that are available to consumers, how much would such a kit cost roughly?

spyder 1243 days ago

He means multiple GPUs in parallel that have a combined VRAM of that size. So around 4 x NVIDIA A100 80GB, which you can get for around $8.4 / hour in the cloud. or 7 x NVIDIA A6000 or A40 48GB for $5.5 / hour

So not exactly cheap or easy yet for the everyday user, but I believe the models will become smaller and more affordable to run, these are just the "first" big research models focused demonstrating some usefulness after that they can be more focus on the size and speed optimizations. There are multiple methods and lot of research into making them smaller with distilling them, converting to lower precision, pruning the less useful weights, sparsifying. Some achieve around 40% size reduction 60% speed improvement with minimal accuracy loss, others achieve 90% sparsity. So there is hope to run them or similar models on a single but powerful computer.

uni_rule 1243 days ago

You'd basically need a rack mount server full of Nvidia H100 cards (80 Vram, they cost $40 thousand us dollars each). So... good luck with that? On the relatively cheap end Nvidia tesla cards are kinda cheap used, 24 gig ones going for ~$200 with architectures from a few years ago. That's still nearly $3000 worth of cards not counting the rest of the whole computer. This isn't really something you can run out home without having a whole "operation" going on.

logicallee 1243 days ago

got it, thanks.

flockonus 1243 days ago

fp4 ?= float point of 4 bits??? I was already mind blown by floats of 8b, how can you fit any float precision in 4b?

Dylan16807 1243 days ago

For weights, the order of magnitude is the important part. And the sign bit. So you can get pretty good coverage with only 16 values.

JellyBeanThief 1243 days ago

Down that far, I start to wonder if trinary circuits might become useful again.

fp4 with 1-3-0 would mean 27 values if the first bit were interpreted as binary. But--and an engineer should check me on this cause to me a transistor is a distant abstraction--I think you could double that to 54 values if you were clever with the sign bit and arithmetic circuitry. Maybe push it to 42 if only some of my intuition is wrong.

blagie 1243 days ago

You're wrong on many levels.

The basic reason for binary is because it's generally faster, especially as you scale to smaller transistors with more noise.

int_19h 1241 days ago

Here's how Brusentsov (who designed https://en.wikipedia.org/wiki/Setun) described the rationale for his choice of ternary:

"At that time [1955], transistors were not yet available, but it was clear that the machine should not use vacuum tubes. Tubes have a short lifespan, and tube-based machines were idle most of the time because they were always being repaired. A tube machine worked at best for several hours, then it was necessary to look for another malfunction. Yuli Izrailevich Gutenmakher built the LEM-1 machine on ferrite-diode elements. The thought occurred to me that since there are no transistors, then you can try to make a computer on these elements. Sobolev, whom everyone respected very much, arranged for me to go on an internship with Gutenmacher. I studied everything in detail. Since I am a radio engineer by education, I immediately saw that not everything should be done the way they did it. The first thing I noticed is that they use a pair of cores for each bit, one working and one compensating. And an idea came to my mind: what if we make the compensation core do work, as well? Then each cell becomes three-state. Consequently, the number of cores in Setun was seven times less than in LEM-1."

(https://notesofprogrammer.blogspot.com/2010/03/blog-post.htm...)

Dylan16807 1243 days ago

But why? There's nothing special about having 4 storage elements. If you want 54 values then 6 bits are going to be just as effective as 4 trits, and easier to implement in every way.