| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by imranq 933 days ago

LLMs are comprised of just three elements

Data

Compute

Algorithms

All three are just scratching the surface of what is possible.

Data: What has been scraped off the internet is just <0.001% of human knowledge as most platforms cannot be scraped so easily, are in formats that are not in text like video, audio, or just plain old pieces of paper undigitized. Finally there are probably techniques to increase data through synthetic means, which is purportedly OpenAI's secret sauce to GPT-4's quality.

Compute: While 3nm processes are approaching an atomic limit (0.21nm for Si), there is still room to explore more densely packed transistors or other materials like Gallium Nitride or optical computing. Not only that but there is a lot of room in hardware architecture to allow more parallelism and 3-D stacked transistors.

Algorithms: The transformer and other attention mechanisms have several sub-optimal components to them like how arbitrary the Transformer is in terms of design decisions, and quadratic time complexity for attention. There also seems to be a large space of LLM augmentations like RLHF for instruction following and improvements in factuality and other mechanisms.

And these ideas are just from my own limited experience. So I think its fair to say that LLMs have plenty of room to improve.

3 comments

discreteevent 933 days ago

> LLMs are comprised of just three elements

> Data

> Compute

> Algorithms

Not to be facetious but so is all other software. LLMs appear to scale in correlation to the first two but it's not clear what the correlation is and that's the basis of the question being asked.

link

joecool1029 933 days ago

For data though, as LLM's generate more output, over time wouldn't they be expected to mess themselves with their own generated data?

Wouldn't that be the wall we'll hit? Think of how shitted up Google Search is with generated garbage, I'm imagining we're already in the 'golden age' where we were able to train on good datasets before it gets 'polluted' with LLM generated data that may not be accurate, and it just continues to become less accurate over time.

link

gaganyaan 933 days ago

I don't really buy that line of argument. There's still useful signals, like upvotes, known human writing, or just plain spending time/money to label it yourself. There's also the option of training better algorithms on pre-LLM datasets. It's something to consider, but not any sort of crisis.

link

throwaway4aday 933 days ago

Cleaning and preparing the dataset is a huge part of training. Like the OP mentioned, OpenAI likely have some high quality automation for doing this and that's what's given them a leg up above all other competitors. You can apply the same automation to clear out low quality AI content the same way you remove low quality human content. It's not about the source, just the quality matters.

link

edmundsauto 933 days ago

There must be signals in the data about generated garbage, otherwise humans wouldn't be able to tell. Something like PageRank would be a game changer and potentially solve this issue.

link

quickthrower2 933 days ago

We need models that need less language data to train. Babies learn to talk on way less data than the entire internet. We need something closer to human experience. Kids have a feel for what is bullshit before they have consumed the entire internet :-).

I think feeding the internet into a LLM will be seen as the mainframe days of AI.

link

nikhil896 933 days ago

My counter-point to this is that babies are born with a sort of basic pre-trained LLM. Humans are born with our analogical weights & biases in our brains partly optimized to learn language, math, etc. Before pre-training an LLM, the weights & biases of their analogical brain are initialized with random values. Training on the internet can IMO be seen as a kind of "pre-training"

link

Volundr 933 days ago

> Babies learn to talk on way less data than the entire internet.

Is this actually true? My gut check says yes, but I'm also unaware of any meaningful way to actually quantify the volume of sensor data processed by a baby (or anyone else for that matter), and it wouldn't shock me to discover if we could we'd find it to be a huge volume.

link

quickthrower2 933 days ago

Ah yes. I should be more precise. Less data that is textual. Of course other data sources are plentiful. Including internal and external sensory.

link

pyuser583 933 days ago

Babies in ancient societies certainly had less exposure to written language, much lower vocabulary, less exposure to music, etc.

link

Volundr 932 days ago

Sure the breadth is (maybe) smaller, but the question is volume. Babies get years of people talking around them, as well as data from their own muscles and vocalizations fed back to them. Is the volume they have consumed to the point the begin talking actually less than the volume consumed by an LLM?

link

pyuser583 932 days ago

If you’re taking about babies in ancient societies (which I am), the answer is absolutely yes. They were exposed to much less language, and much less sound, than we are.

link

imtringued 933 days ago

I'm sorry but you can have process nodes smaller than an atom. The size of atoms is irrelevant here. The process node refers to what dimensions a theoretical planar transistor would have to be equivalent to the current 3D transistors. If you stack multiple transistors on top of one another, the process node gets smaller regardless of what you think.

link