| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by adam_arthur 1247 days ago

So what's the level of effort to create ChatGPT equivalent products?

Is it something where we'll have 100s of competing AIs, or is it gated to only a few large companies? Not up to date on current training/querying costs.

Can these models feasibly be run locally?

Given the large number of competitors already announced to ChatGPT, I fail to see how the space will be easily defensible or monetizable (despite large value add, competitors can easily undercut eachother)

3 comments

johnc1 1247 days ago

> Can these models feasibly be run locally?

Actually you can, it even works without GPU, here's a guide on running BLOOM (the open-source GPT-3 competitor of similar size) locally: https://towardsdatascience.com/run-bloom-the-largest-open-ac...

The problem is performance: - if you have GPUs with > 330GB VRAM, it'll run fast - otherwise, you'll run from RAM or NVMe, but very slowly - generating one token every few minutes or so (depending on RAM size / NVMe speed)

The future might be brighter: fp8 already exists and halves the RAM requirements (although it's still very hard to get it running), and there is ongoing research on fp4. Even that would still require 84GB of VRAM to run...

Towaway69 1246 days ago

From guide linked above:

> It is remarkable that such large multi-lingual model is openly available for everybody.

Am I the only one thinking that this remark is a insight into societal failure? The model has been trained on global freely available content, anyone who has published on the Web has contributed.

Yet the wisdom gained from our collective knowledge is assumed to be withheld from us. As the original remark was one of surprise, the authors (and our) assumption is that trained models are expected to be kept from us.

ornornor 1246 days ago

I think it’s similar to how search engines keep their ranking formulas secret, and you can’t run your own off a copy of their index.

Yet we also all contributed to it by publishing (and feeding it, for instance by following googles requirements for micro data). But we don’t own any of it.

capableweb 1246 days ago

Main difference with a search engine is that a search engine ultimately links back to you. So the user, interested in more or want to know where it comes from, ends up on your website.

The same is not true for these AI tools. The output could have been contributed by you, someone else, or everyone, or a combination of those, but it'll never be clear who actually contributed and there will be no credit to anyone besides the author(s) of the models.

ornornor 1246 days ago

Didn’t think of it this way, that makes sense. Thank you

lacasito25 1246 days ago

How much money you think gpt3 training costed?

Towaway69 1246 days ago

How much money do we spend contributing to the training set?

Those insights, comments, articles, code example, etc are free to use because we published those on sites that don't own the content but earn from it. If they owned them, the they would be responsible for hate speech.

So our costs for producing the training set is negligible.

PartiallyTyped 1246 days ago

I recommend reading the first few chapters of "The conquest of bread".

Dylan16807 1247 days ago

If it fits in system memory, is it still faster on GPU than CPU? Does that involve swapping out one layer at a time? Otherwise I'm very curious how it handles the PCIe latency.

Enough system memory to fit 84GB isn't all that expensive...

tempay 1247 days ago

Yes, the connection between system memory and the GPU isn’t fast enough to keep the compute units fed with data to process. Generally PCIe latency isn’t as much of a problem as bandwidth.

adam_arthur 1247 days ago

Pretty cool!

Honestly even if it were to take a few minutes per response, that's likely sufficient for many use cases. I'd get value out of that if it allowed bypassing a paywall. I'm curious how these models end up being monetized/supported financially, as they sound expensive to run at scale.

The required disk space seems the biggest barrier for local.

afro88 1246 days ago

If it's a few minutes per token you might be waiting a lot longer for a full response: https://blog.quickchat.ai/post/tokens-entropy-question/

I also wonder how open.ai etc provides access to these for free. Reminds me of the adage from when Facebook rose to popularity: "if something is free, 'you' are the product". Perhaps to gather lots more conversational training data for fine tuning.

JackFr 1246 days ago

It would be remarkable and surprising if they weren’t doing that.

int_19h 1244 days ago

It's in their FAQ:

>> Who can view my conversations?

> As part of our commitment to safe and responsible AI, we review conversations to improve our systems and to ensure the content complies with our policies and safety requirements.

>> Will you use my conversations for training?

> Yes. Your conversations may be reviewed by our AI trainers to improve our systems.

JellyBeanThief 1247 days ago

Crowd-funded AI training coming soon to Patreon?

justplay 1246 days ago

do it now

logicallee 1246 days ago

> if you have GPUs with > 330GB VRAM, it'll run fast

What kind of GPU's have that that are available to consumers, how much would such a kit cost roughly?

spyder 1246 days ago

He means multiple GPUs in parallel that have a combined VRAM of that size. So around 4 x NVIDIA A100 80GB, which you can get for around $8.4 / hour in the cloud. or 7 x NVIDIA A6000 or A40 48GB for $5.5 / hour

So not exactly cheap or easy yet for the everyday user, but I believe the models will become smaller and more affordable to run, these are just the "first" big research models focused demonstrating some usefulness after that they can be more focus on the size and speed optimizations. There are multiple methods and lot of research into making them smaller with distilling them, converting to lower precision, pruning the less useful weights, sparsifying. Some achieve around 40% size reduction 60% speed improvement with minimal accuracy loss, others achieve 90% sparsity. So there is hope to run them or similar models on a single but powerful computer.

uni_rule 1246 days ago

You'd basically need a rack mount server full of Nvidia H100 cards (80 Vram, they cost $40 thousand us dollars each). So... good luck with that? On the relatively cheap end Nvidia tesla cards are kinda cheap used, 24 gig ones going for ~$200 with architectures from a few years ago. That's still nearly $3000 worth of cards not counting the rest of the whole computer. This isn't really something you can run out home without having a whole "operation" going on.

logicallee 1246 days ago

got it, thanks.

flockonus 1247 days ago

fp4 ?= float point of 4 bits??? I was already mind blown by floats of 8b, how can you fit any float precision in 4b?

Dylan16807 1247 days ago

For weights, the order of magnitude is the important part. And the sign bit. So you can get pretty good coverage with only 16 values.

JellyBeanThief 1247 days ago

Down that far, I start to wonder if trinary circuits might become useful again.

fp4 with 1-3-0 would mean 27 values if the first bit were interpreted as binary. But--and an engineer should check me on this cause to me a transistor is a distant abstraction--I think you could double that to 54 values if you were clever with the sign bit and arithmetic circuitry. Maybe push it to 42 if only some of my intuition is wrong.

blagie 1246 days ago

You're wrong on many levels.

The basic reason for binary is because it's generally faster, especially as you scale to smaller transistors with more noise.

int_19h 1244 days ago

Here's how Brusentsov (who designed https://en.wikipedia.org/wiki/Setun) described the rationale for his choice of ternary:

"At that time [1955], transistors were not yet available, but it was clear that the machine should not use vacuum tubes. Tubes have a short lifespan, and tube-based machines were idle most of the time because they were always being repaired. A tube machine worked at best for several hours, then it was necessary to look for another malfunction. Yuli Izrailevich Gutenmakher built the LEM-1 machine on ferrite-diode elements. The thought occurred to me that since there are no transistors, then you can try to make a computer on these elements. Sobolev, whom everyone respected very much, arranged for me to go on an internship with Gutenmacher. I studied everything in detail. Since I am a radio engineer by education, I immediately saw that not everything should be done the way they did it. The first thing I noticed is that they use a pair of cores for each bit, one working and one compensating. And an idea came to my mind: what if we make the compensation core do work, as well? Then each cell becomes three-state. Consequently, the number of cores in Setun was seven times less than in LEM-1."

(https://notesofprogrammer.blogspot.com/2010/03/blog-post.htm...)

Dylan16807 1246 days ago

But why? There's nothing special about having 4 storage elements. If you want 54 values then 6 bits are going to be just as effective as 4 trits, and easier to implement in every way.

wokwokwok 1247 days ago

> Can these models feasibly be run locally?

Bluntly, no.

The models which are small enough to run locally perform so badly it’s not worth bothering.

To run inference on the large models the perform decently you need the equivalent of two or three top end graphics cards.

If you're serious about looking into it now, consider looking at this project that lets you run a bunch of independent machines as a cluster for inference using Bloom:

https://github.com/bigscience-workshop/petals/wiki/Launch-yo...

(You'll need around 200GB of GPU memory across the machines in the swarm)

corobo 1247 days ago

How badly is bad? What sort of output are we talking?

I am asking as I once had a Markov-chain IRC bot* and while it often struggled to string together a sentence, it was quite hilarious sometimes. Absolutely pointless other than the occasional laugh.

Can it form sentences or are those small models completely unusable for anything?

I'm not thinking OpenAI level uses - sort of compare a Postgres cluster to a SQLite file (not literally, conceptually I guess). Can it be used for single tasks in any way?

Could it figure out how to map search terms to URLs for a knowledge base type thing?

Forgive me if these are silly questions. The extent of my knowledge in this field is asking ChatGPT questions and going "that's so cool" when it answers.

* Your phone's predictive text except it finishes the sentence itself based on a word someone in chat used so that it felt on topic.

In my case it also learned how to form sentences from other people talking in chat, in hindsight it's amazing I never had a Tay issue.

https://en.m.wikipedia.org/wiki/Tay_(bot)

ggerganov 1246 days ago

I was recently playing with the GPT-2 and GPT-J models. Results are often non-sensical for any practical purposes, but I think can be used for making something fun - similar to your IRC bot idea.

If you are interested in running these models yourself without having a beefy GPU, you can try my custom inference implementation. It's in pure C/C++ without any 3rd party dependencies, runs straight on the CPU and builds very easily. I think it is relatively well optimised. For example, on a MacBook M1 Pro I can run GPT-2 XL (1.5B params) at 42ms/token and GPT-J / GPT-JT (6B params) at 125ms/token.

Here are a couple of generated examples using GPT-J:

https://github.com/ggerganov/ggml/tree/master/examples/gpt-j

These are examples using zero-shot prompt where the model auto-completes a text given a starting prompt. You can try to make a conversation bot with a few-shot prompt, but it's not great. Probably the model needs some fine-tuning for that to become feasible.

corobo 1246 days ago

I'll have to have a look into those, there's an audience of about 12 people that would be thrilled to hear "eggbot" is back with a shiny modern brain on, haha.

Oddly enough any processing delay is good in an "AI" chat bot, within reason, makes it feel more natural rather than getting a response ping instantly. Chat version of uncanny valley or something, haha.

Something it also did in Markov form was pick randomly from the longest words in the sentence it had decided to reply to, build the rest of it from that, then run itself "backwards" from the picked word to a sentence starter word it knew.

Thank you for the reply! Looking forward to some tinkering.

wokwokwok 1247 days ago

> or are those small models completely unusable for anything?

Sadly, they really offer almost no value.

For the effort, you’re better off with an NLP framework like spacy.

You can play with the small neo gpt models on hugging face, eg. https://huggingface.co/EleutherAI/gpt-neo-125M

…but, the tldr is they’re cute to play with, but practically, the content they can generate is short, inconsistent and full of errors.

blagie 1246 days ago

.... which is actually not of "almost no value." The value of smaller models is different. For example, I have anonymized data, with fields removed. The smaller models do fine for filling those fields in with plausible values.

The smaller models do okay for zero-shot clustering of data in many cases (e.g. liberal versus conservative text), and if not, with minimal training. For generating statistics or probabilistic information about large numbers of text, they're great.

GPT-3, they're not, but I use them in my day-to-day work quite a bit more than I thought I would. I bought a GPU for one purpose, and I find I spin it up a lot these days.

I /really/ want to be able to use a large-scale language model locally, though. For the types of things I'd like it for, such as helping me draft emails, I don't trust OpenAI with my data.

visarga 1246 days ago

FLAN T5 shows promising signs, but it doesn't get even to 50% of GPT-3 performance.

anigbrowl 1247 days ago

How much of this is the language vs the vast amount of passably accurate domain knowledge? ChatGPT etc. seem magic because they can answer questions about virtually anything with a high degree of plausibility. It often gets specific facts wrong, but the general contours are correct. Many of us know a lot of trivia/specialist knowledge, but I don't think anyone is as broadly informed as ChatGPT appears to be. It's not clear where the language ends and the encyclopedic knowledge starts, but the latter must be taking up a very large amount of the space in the model.

visarga 1246 days ago

There have been attempts to separate fact knowledge from language knowledge - for example DeepMind RETRO that uses a search index of 1T tokens. RETRO manages to reach GPT-3 performance on some tasks with a 20x smaller model. I believe smaller model are more useful for extractive and classification tasks than creative text generation.

dragonwriter 1246 days ago

> How much of this is the language vs the vast amount of passably accurate domain knowledge?

LLMs don’t have domain knowledge, its all language.

anigbrowl 1246 days ago

That's what I meant by 'It's not clear where the language ends and the encyclopedic knowledge starts,' since the model (and perhaps our brains) make little distinction.

But the model seems to be storing an absolutely vast amount of information, beyond the the capability of any individual person to accumulate and recall. This is clearly not a prerequisite for language, even if the information is represented linguistically. Put another way, at age 20 I had read maybe 10-20% of what I've read since, but I was capable of reading comprehension and conversation even though my levels of knowledge and insight were much lower. By 'comprehension' I mean in the sense of being able to read a piece of text and answer questions about it or rewrite it, without necessarily having any priors about the topic; the kind of task we expect to be able to assign to a high school graduate.

I'm wondering what the size of an 'ignorant' language model is, as a precursor to more curated/directed training. While the state of the art is very impressive, it's a bit like taking a feast for a thousand people and rendering it into a giant cube of spam. This strategy seems guaranteed to produce a succession of increasingly capable idiots savant but limits other avenues of exploration.

simne 1245 days ago

> at age 20 I had read maybe 10-20% of what I've read since, but I was capable of reading comprehension and conversation...

This is because human intelligence is not just language, but lot of indirect context, "software" inside spinal cord (and other non-cortex parts of brain), and even human body itself.

But as I know, current LLMs working in plain flat structures. At the moment, nobody tried to use even neocortex-like structures, not even considered artificial spinal cord.

All these looks like, to teach table lamp, or something similar smart.

adam_arthur 1247 days ago

Hmmm, 2-3 high end GPUs implies it's likely not very far off from mainstream. Maybe runnable on the average device within 10-20 years... perhaps even sooner if the model/software can be optimized?

dahdum 1247 days ago

> 2-3 high end GPUs implies it's likely not very far off from mainstream

Looks like FLOP/s per $ are doubling every ~3 years for high end cards, and 10x in ~10 years. So probably not that far off for desktop users.

https://www.lesswrong.com/posts/c6KFvQcZggQKZzxr9/trends-in-...

visarga 1246 days ago

FLOPs remain unused when the bottleneck is bandwidth.

bemmu 1247 days ago

I’d be surprised if there weren’t any algo breakthroughs before that to make these several times faster (10x?).

(such as are all of the weights really needed all of the time, or could you load different ones based on recent context?)

If in 10 years on top of that you’d have that 10x faster hardware as well, you might be running GPT-3s as just a subcomponent of games/apps.

ETH_start 1247 days ago

ChatGPT prompt: What's the difference between a component and subcomponent?

A component is a self-contained unit of functionality, often in the form of a class or object, that represents a piece of an overall system or application. A subcomponent, on the other hand, is a smaller, simpler component that is used as part of a larger component. A subcomponent is typically used to break down a larger component into smaller, more manageable pieces, and is often used to create reusable building blocks for larger components. In other words, a subcomponent is a component that is used by another component.

lolspace 1247 days ago

20 years?

adam_arthur 1247 days ago

In 20 years I'll still be alive and enjoying myself. Does 20 years seem long to you?

krsdcbl 1247 days ago

it seems a bit long for achieving just running your own AI models locally, given it seems to be largely a question of vram and that you already _could_ do it today with a handful of graphic cards.

20y ago we had the GeForce Ti4400, current graphics cards now come with 100x the vram and 50-60x the bandwidth.

5e92cb50239222b 1246 days ago

That's one hell of an assumption. Many of my Russian friends were absolutely certain they'll be alive and well for at least the next 20 years not that long ago.

simne 1245 days ago

Sure, they will be alive if not come to Ukraine.

adam_arthur 1246 days ago

If I'm dead then being able to run an AI locally doesn't matter anyway

lfkdev 1247 days ago

Two or three top GPUs? Thats basically nothing for a professinal project or even an investeded hobby

ck2 1247 days ago

Datasets.

The one with the largest, most personal, most obtrusive, invasive dataset will probably win.

The one that has absorbed every podcast, every youtube video, every close-caption text in existence, will have the most "complete" answers.

visarga 1246 days ago

Hidden datasets can be replaced with model predictions collected from a public API. So they can be "exfiltrated" from the trained model. And we already maxed out on the accessible online text and the good quality sources.

What is going to make a difference is running models to generate more text for training, because relying on humans alone doesn't scale. For example we could be using LLMs to do brute force problem solving and then fine-tuning on solutions.

AlphaZero is the shining example of a model trained on its own generated data and surpassing us at our own game. The self generated data approach has potential to reach super human levels of performance.

ck2 1246 days ago

How about illegal datasets like all the phone calls the NSA has been collecting domestically? Someone is going to train a private ChatGPT with that for queries.

simne 1245 days ago

Only legally gathered, absolutely "white" datasets could win, because gray/black methods of gathering lack feedback.

You have not methods to ensure, if gray/black really gather data or they faked it.