| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by visarga 1084 days ago
	The inspiration for weight decay was to reduce the capacity to memorize of the model until it perfectly fits the complexity of the task, not more not less. A model more complex than the task is over-fitting, the other one is under-fitting. Got to balance them out. But the best cure for over-fitting is to make the dataset larger and ensure data diversity. LLMs have datasets so large they usually train one epoch.

3 comments

nightski 1084 days ago

It sounds nice in theory, but the data itself could be problematic. There is no temporal nature to it. You can have duplicate data points, many data points that are closely related but describe the same thing/event/etc.. So while only showing the model each data point once ensures you do not introduce any extra weight on a data point, if the dataset itself is skewed it doesn't help you at all.

Just by trying to make the dataset diverse you could skew things to not reflect reality. I just don't think enough attention has been paid to the data, and too much the model. But I could be very wrong.

There is a natural temporality to the data humans receive. You can't relive the same moment twice. That said, human intelligence is on a scale too and may be affected in the same way.

visarga 1084 days ago

> I just don't think enough attention has been paid to the data, and too much the model.

I wholly agree. Everyone is blinded by models - GPT4 this, LLaMA2 that - but the real source of the smarts is in the dataset. Why would any model, no matter how its architecture is tweaked, learn about the same ability from the same data? Why would humans be all able to learn the same skills when every brain is quite different. It was the data, not the model

And since we are exhausting all the available quality text online we need to start engineering new data with LLMs and validation systems. AIs need to introspect more into their training sets, not just train to reproduce them, but analyse, summarise and comment on them. We reflect on our information, AIs should do more reflection before learning.

More fundamentally, how are AIs going to evolve past human level unless they make their own data or they collect data from external systems?

ben_w 1084 days ago

> It was the data, not the model

It's both.

It's clearly impossible to learn how to translate Linear A into modern English using only content written in pure Japanese that never references either.

Yet also, none of the algorithms before Transformers were able to first ingest the web, then answer a random natural language question in any domain — closest was Google etc. matching on indexed keywords.

> how are AIs going to evolve past human level unless they make their own data?

Who says they can't make their own data?

Both a priori (by development of "new" mathematical and logical tautological deductions), and a posteriori by devising, and observing the results of, various experiments.

Same as us, really.

riversflow 1084 days ago

I see this brought up consistently on the topic of AI take-off/X-risk.

How does an AI language model devise an experiment and observe the results? The language model is only trained on what’s already known, I’m extremely incredulous that this language model technique can actually reason a genuinely novel hypothesis.

A LLM is a series of weights sitting in the ram of GPU cluster, it’s really just a fancy prediction function. It doesn’t have the sort of biological imperatives (a result of being complete independent beings) or entropy that drive living systems.

Moreover, if we consider how it works for humans, people have to _think_ about problems. Do we even have a model or even an idea about what “thinking” is? Meanwhile science is a looping process that mostly requires a physical element(testing/verification) to it. So unless we make some radical breakthroughs in general purpose robotics, as well as overcome the thinking problem I don’t see how AI can do some sort tech breakout/runaway.

ben_w 1084 days ago

Starting with the end so we're on the same page about framing the situation:

> I don’t see how AI can do some sort tech breakout/runaway.

I'm expecting (in the mode, but with a wide and shallow distribution) a roughly 10x increase in GDP growth, from increased automation etc., not a singularity/foom.

I think the main danger is bugs and misuse (both malicious and short-sighted).

-

> How does an AI language model devise an experiment and observe the results?

Same way as Helen Keller.

Same way scientists with normal senses do for data outside human sense organs, be that the LHC or nm/s^2 acceleration of binary stars or gravity waves (or the confusingly similarly named but very different gravitational waves).

> The language model is only trained on what’s already known, I’m extremely incredulous that this language model technique can actually reason a genuinely novel hypothesis.

Were you, or any other human, trained on things unknown?

If so, how?

> A LLM is a series of weights sitting in the ram of GPU cluster, it’s really just a fancy prediction function. It doesn’t have the sort of biological imperatives (a result of being complete independent beings) or entropy that drive living systems.

Why do you believe that biological imperatives are in any way important?

I can't see how any of a desire to eat, shag, fight, run away, or freeze up… help with either the scientific method nor pure maths.

Even the "special sauce" that humans have over other animals didn't lead to any us doing the scientific method until very recently, and most of us still don't.

> Do we even have a model or even an idea about what “thinking” is?

AFAIK, only in terms of output, not qualia or anything like that.

Does it matter if the thing a submarine does is swimming, if it gets to the destination? LLMs, for all their mistakes and their… utterly inhuman minds and transhuman training experience… can do many things which would've been considered "implausible" even in a sci-fi setting a decade ago.

> So unless we make some radical breakthroughs in general purpose robotics

I don't think it needs to be general, as labs are increasingly automated even without general robotics.

kaba0 1084 days ago

> Do we even have a model or even an idea about what “thinking” is

At the least, it is a computable function (as we don’t have any physical system that would be more general than that, though some religions might disagree). Which already puts human brains ahead of LLM systems, as we are Turing-complete, while LLMs are not, at least in their naive application (their output can be feeded to subsequent invocations and that way it can be).

swid 1083 days ago

I googled whether or not universal function approximators, which neural nets are considered, are also considered Turing complete. It seems the general consensus is kind of not, since they are continuous and can’t do discreet operations in the same way.

But also, that isn’t quite the whole story, since they can be arbitrarily precise in their approximation. Here[0] is a white paper addressing this issue which concludes attention networks are Turing complete.

0: https://jmlr.org/papers/volume22/20-302/20-302.pdf

ben_w 1084 days ago

Is it provably not turning complete? That property pops up everywhere even when not intended, like Magic: The Gathering card interactions.

Technically you may not want to call it Turing complete given the limited context window, but I'd say that's like insisting a Commodore 64 isn't Turing complete for the same reason.

Likewise the default settings may be a bit too random to be a Turing machine, but that criticism would also apply to a human.

imtringued 1084 days ago

It's not just a series of weights. It is an unchanging series of weights. This isn't necessarily artificial intelligence. It is the intelligence of the dead.

whimsicalism 1084 days ago

> Yet also, none of the algorithms before Transformers were able to first ingest the web, then answer a random natural language question in any domain — closest was Google etc. matching on indexed keywords.

Wrong, recurrent models were able to do this, just not as well.

Salgat 1084 days ago

This is definitely current models' biggest issue. You're training a model against millions of books worth of data (which would take a human tens of thousands of lifetimes) to achieve a superficial level of conversational ability to match a human, which can consume at most 3 novels a day without compromising comprehension. Current models are terribly inefficient when it comes to learning from data.

famouswaffles 1084 days ago

Modern LLMs are nowhere near the scale of the human brain however you want to slice things so terribly inefficient is very arguable. also language skills seemingly take much less data and scale when you aren't trying to have it learn the sum total of human knowledge. https://arxiv.org/abs/2305.07759

Salgat 1084 days ago

Scale is a very subjective thing since one is analog (86B neurons) and one is digital (175B parameters). Additionally, consider how many compute hours GPT 3 took to train (10,000 V100s were set aside for exclusive training of GPT 3). I'd say that GPT 3 scale vastly dwarfs the human brain, which runs at a paltry 12 watts.

kaba0 1084 days ago

Neumann’s Computer and The Brain book is way out of date in terms of today’s hardware, but funnily it is still relevant in this metric. Biological systems are more analogous to a distributed system of small, very slow CPUs. Even GPUs that somewhat close the gap in-between the few, crazy fast CPUs vs the aforementioned many, slow ones - are still much faster than any one neuron in calculations, but are still overly serial. It is not the number of CPUs, but the number of their connections that make biological systems so powerful.

whimsicalism 1084 days ago

You have to count the training process from the origin of the human brain imo, not from the birth of any individual human.

Neural nets look much more competitive by that standard.

Salgat 1084 days ago

Yet humans designed the models, so the training process for chat gpt etc includes human evolution by your logic.

whimsicalism 1084 days ago

This is a good point and the level of so-called task specific "inductive bias" in models is an active point of discussion, but I don't think it is fair to add all of our evolution to the model inductive bias because most of evolution was not towards giving better understanding of language to the model, it was towards better understanding of language in humans.

imtringued 1084 days ago

They are inefficient by design. Gradient descent and backpropagation scale poorly, but they work and GPUs are cheap, so here we are.

crdrost 1084 days ago

And there have been a lot of approaches to do this, my favorite one being the idea that maybe if we just randomly zap out some of the neurons while we train the rest, that forcing it to acquire that redundancy might privilege structured representations over memorization. Just always seemed like some fraternity prank, “if you REALLY know the tenets of Delta Mu Beta you can recite them when drunk after we spin you around in a circle twelve times fast!”

two_in_one 1084 days ago

> just randomly zap out some of the neurons while we train the rest

It's already done: https://pytorch.org/docs/stable/generated/torch.nn.functiona...

whimsicalism 1084 days ago

https://nitter.net/Yampeleg/status/1688441683946377216

kaibee 1084 days ago

> But the best cure for over-fitting is to make the dataset larger and ensure data diversity.

This is also good life advice.