| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by typpilol 237 days ago
	It will require like 20x the compute

3 comments

ACCount37 237 days ago

A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".

If we had a million times the compute? We might have brute forced our way to AGI by now.

link

Jensson 237 days ago

But we don't have a million times the compute, we have the compute we have so its fair to argue that we want to prioritize other things.

link

Mehvix 237 days ago

Why do you suppose this is a compute limited problem?

link

ACCount37 237 days ago

It's kind of a shortcut answer by now. Especially for anything that touches pretraining.

"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.

The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.

A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.

link

typpilol 236 days ago

Thanks.

Also, saying it needs 20x compute is exactly that. It's something we could do eventually but not now

link

kenjackson 237 days ago

Why so much compute? Can you tie it to the problem?

link

typpilol 236 days ago

Tokenizers are the reason LLMs are even possible to run at a decent speed on our best hardware.

Removing the tokenizer would 1/4 the context and 4x the compute and memory, assuming an avg token length of 4.

Also, you would probably need to 4x the parameters to have to learn understanding between individual characters as well as words and sentences etc.

There's been a few studies on small models, even then those only show a tiny percentage gain over tokenized models.

So essentially you would need 4x compute, 1/4 the context, and 4x the parameters to squeeze 2-4% more performance out of it.

And that fails when you use more then 1/4 context. So realistically you need to support the same context, so you r compute goes up another 4x to 16x.

That's why

link

ashirviskas 234 days ago

This has a ton of seemingly random assumptions, why can't we compress multiple latent space representations into one? Even in simple tokenizers token "and" has no right being the same size as "scientist".

link

kenjackson 236 days ago

Thanks. That helps a lot.

link