Hacker News new | ask | show | jobs
by grumbelbart2 521 days ago
They could, would and should. But: Training a state of the art LLM costs millions in GPU, electricity alone. There is no "open" organization at this point that can cover this. Current "open source public models" are shared by big players like Meta to undermine the competition. And they only publish their weights, not the training data, training protocols, training code; meaning it's not reproducible, and questionable if the training data is kosher.
4 comments

I think it's important to remember that we know neural networks can be trained to a very useful state from scratch for 24 GJ: This is 25 W for 30 years (or 7000 kWh, or a good half ton of diesel fuel), which is what a human brain consumes until adulthood.

Even though our artificial training efficiency is worse now, likely to stay worse because we want to trade efficiency for faster training, and because we want to cram more knowledge into our training data than a human would be exposed to, it still seems likely to me that we'll get within orders of magnitude of this sooner or later.

Even if our training efficiency topped out at a hundred times worse than a biological system, that would be the energy equivalent of <100 tons of diesel fuel. Compared to raising and educating a human (and also considering this training can the be utilized for billions of queries before it becomes obsolete) that strikes me as a very reasonable cost (especially compared to the amounts of energy we wasted on cryptocurrency mining without blinking an eye...)

This misses that evolution has been pre-training the human cognitive architecture - brain, limbic system, sympathetic and parasympathetic nervous systems, coevolved viral and bacterial ecosystems - for millions of years. We're not a tabula rasa training at birth to perfectly fit whatever set of training data we're presented. Far from it. Human learning is more akin to RAG, or test time training - specialising a heavily pre-trained model. It's not that we're born with very much knowledge, it's more that we're heavily specialised to acquire and retain certain kinds of knowledge and behaviour that are adaptive in the EEA (environment of evolutionary adaptedness). If the environment then doesn't provide the correct triggers at the correct times for activation of various learning mechanisms - best known being the critical period for language acquisition, we don't unfold into fully trained creatures. Bear in mind also that the social environment is vital both for human learning and functioning - we learn in the emotional, cognitive and resource provision context of other humans. And what we learn are behaviours that are effective in that context. Even in adulthood, the quickest way to make our cognitive architecture break down is to deny us social contact (hence the high rates of 'mental illness' in solitary confinement).
> This misses that evolution has been pre-training the human cognitive architecture - brain, limbic system, sympathetic and parasympathetic nervous systems, coevolved viral and bacterial ecosystems - for millions of years. We're not a tabula rasa training at birth to perfectly fit whatever set of training data we're presented. Far from it. Human learning is more akin to RAG

Yes, but.

The human genome isn't that big (3.1 gigabases), and most of that is shared with other species that aren't anything like as intelligent — it's full of stuff that keeps us physically alive, lets us digest milk as adults, darkens our skin when exposed to too much UV so we don't get cancer, gives us (usually) four limbs with (usually) five digits that have keratin plates on their tips, etc.

That pre-training likely gives us innate knowledge of smiles and laughter, of the value judgment that pain is bad and that friendship is good, and (I suspect from my armchair) enough* of a concept of gender that when we hit puberty we're not all bisexual by default.

Also, there's nothing stopping someone from donating their genome to be used as a pre-training system, if we could decode the genome well enough to map out pre-training like that.

* which may be some proxy for it, e.g. "arousal = ((smell exogenous sex hormone) and (exogenous hormone xor endogenous hormone))", which then gets used to train the rest of our brains for specific interests — evolution is full of hack jobs like that

You've missed the fact that sequencing our genome isn't gathering all the information required. To duplicate a human in computational space - say to create some accelerated AI simulation, you'd need to sequence a complete Telomere-to-Telomere genome (something achieved for the first time only last year!), complete Centromere sequencing (not yet achieved). You'd also need to 'sequence' or somehow encode the epigenome - DNA methylation, histone modifications, and other epigenetic markers. Then you'd need to do the same for both mitochondrial DNA and the human microbiome - every functional bacteria and virus we host (quite the task given how little we understand this ecosystem and its interactions with our own behaviour). Then you'd need to combine genome sequencing with transcriptomics (RNA sequencing), proteomics (proteins), and metabolomics to get a holistic view of human biology.

To make this data 'actionable' for a synthetic intelligence you'd need to functionally replicate the contributions of the intrauterine environment to development, and lastly simulate the social and physical environment. This can't be 'decoded' in the way you implicitly suggest - since it's decompression is computationally irreducible. These are dynamic processes that need to be undergone in order to create the fully developed individual.

[1] https://www.bbc.com/future/article/20230210-the-man-whose-ge...

And most of that is then stuff you can throw away because it's not pre-training your brain; and the stuff that does, while we don't know the full mechanism, we know it works through the laws of physics.

Knowing the weights without knowing the full graph of the model they're used in, just the endpoints.

There's a lot of valid stuff in what you say, I am aware I'm glossing over a lot of challenges to get a copy of a human — to what extent is e.g. the microbiome even contributing to our intelligence, vs. being several hundred different parasites that share a lot of DNA with each other and which happen to accidentally also sometimes give us useful extras? It's hard work telling which is which — but my claim is that the nature and scope of such work itself still allows us to say, as per one of the parent comments:

> I think it's important to remember that we know neural networks can be trained to a very useful state from scratch for 24 GJ: This is 25 W for 30 years (or 7000 kWh, or a good half ton of diesel fuel), which is what a human brain consumes until adulthood.

If this were a 100m sprint, then I would agree with you essentially saying that we don't even know which country the starting blocks are in, but I am still saying that despite that we know the destination can be reached from the starting blocks in 10 seconds.

> This misses that evolution has been pre-training the human cognitive architecture - brain, limbic system, sympathetic and parasympathetic nervous systems, coevolved viral and bacterial ecosystems - for millions of years.

Yes. But that is not part of the training cost; this is basically the equivalent to figuring out a suitable artificial neural net architecture and hyperparameter tuning in general. That is not energy cost that you pay per training run, but fixed cost overhead instead.

You raise a good point that when doing artificial training, the "environment" has to be provisioned as well (i.e. feeding audio/visual/text input in some way to do the training), but here I would argue that in energy terms, that is a rather small overhead (less than an order of magnitude) because our digital information storage/transmission capabilities are frankly insane compared to a human already (and reasonably efficient as well).

It’s like I’m talking to Chomsky again … :)
I understood SETI style meaning crowdsourced. Instead of mining bitcoin you mine LLMs. It's a nice idea I think. Not sure about technical details, bandwidth limitations, performance, etc.
Unfortunately, LLM training is not as computationally easy (embarrassingly parallel) as mining bitcoins.
If that were to be solved (if at all possible, and feasible / competitive) I can definitely see "LLM mining" be a historic milestone. Also much closer to the spirit of F@H in some sense, depending how you look at it. Would there be a financial incentive? And how would it be distributed? Could you receive a stake in the LLM proportional to the contribution you did? Would that be similar in some sense to purchasing stock in an AI company, or mining tokens for a crypto currency? Potentially a lot of opportunity here.
This would require a revolution in the algorithms used to train a neural net: currently LLM training is at best distributed amongst GPUs in racks in the same datacenter, and ideally nearby racks, and that's already a significant engineering challenge, because each step needs to work from the step before, and each step updates all of the weights, so it's hard to parallelise. You can do it a little bit, because you can e.g. do a little bit of training with part of the dataset on one part of the cluster, and another part elsewhere, but this doesn't scale linearly (i.e. you need more compute overall to get the model to converge to something useful), and you still need a lot of bandwidth between your nodes to synchronize the networks frequently.

All of this makes it very poorly suited to a collection of heterogeneous compute connected via the internet, which wants a large chunk of mostly independent tasks which have a high compute cost but relatively low bandwidth requirements.

The models are too large to fit on a desktop GPU's VRAM. Progress would either require smaller models (MoE might help here? not sure) or bigger VRAM. For example training a 70 billion parameter model would require at least 140GB of VRAM in each system, whereas a large desktop GPU (4090) has only 24GB.

You need enough memory to run the unquantized model for training, then stream the training data through - that part is what is done in parallel, farming out different bits of training data to each machine.

Data parallel training is not the only approach. Sometimes the model itself needs to be distributed across multiple GPU.

https://www.microsoft.com/en-us/research/blog/zero-deepspeed...

The communications overhead of doing this over the internet might be unworkable though.

or if the internet became significantly faster fiber connections
damn it! but nice research area
SETI had a clear purpose that donors of computer resources could get behind. The LLM corps early on decided to drink the steering poison that will keep there from ever being a united community for making open LLMs. At best you'll get a fractured world of different projects, each with its own steering directives.
The internet is for ____.

That could be a factor that unites enough people to donate their compute time to build diffusion models. At least if it was easy enough to set up.

Related: people donating computing power to run diffusion and text models, which is definitely largely used for porn.

https://stablehorde.net/

Or the large amounts of community efforts (not exactly crowd sourced though) for diffusion fine-tunes and tools! Pony XL, and other uncensored models, for example. I haven't kept up with the rest, because there's just too much.

You don't have to donate, we will pay you for idle time of your gaming GPU: https://borg.games/setup
Asking to share the training data is a bit too much, it's petabytes of data, probably has privacy implications.

You can study and reproduce with your own training data right?

Probably legals ones too. Such aa evidence of copyright infringement.
Doesn’t Deepseek somewhat counter this narrative?
Don't they have something like 10k plus current gen GPUs?