| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fykem 849 days ago
	More processing power does not make a model better. You can train models on CPUs with same result based on same model architecture and dataset. It'll just take longer to get those results. What makes models "good" is if the dataset "fits" the model architecture properly and you have given it enough time (epochs) to have a semi accurate prediction ratio (lets say 90% accurate). For image classification models I've done around ~100 epochs for 10,000 items seems to be the best certain data sets will ever get. There will at some point come a time when the continued training of the model is either underfitting/overfitting and no amount of continued training/processing power would help improve it.

2 comments

HeavyStorm 849 days ago

The OP asks "per request", not training time.

link

chank 849 days ago

Answer is still no and still for the above reason. Compute resources are only relevant to how fast it can answer not the quality.

link

pixl97 849 days ago

Then why does chain of thought work better than asking for short answers?

link

p1esk 849 days ago

Because it’s a better prompt. Works better for people too.

link

famouswaffles 849 days ago

That's not the only reason.

More tokens = more useful compute towards making a prediction. A query with more tokens before the question is literally giving the LLM more "thinking time"

link

razodactyl 848 days ago

It correlates but the intuition is a bit misleading. What's actually happening is that by asking a model to generate more tokens, it increases the amount of information it has learnt to be present in its context block.

It's why "RAG" techniques work, the models learn during training to make use of information in context.

At the core of self-attention is dot product measurement which causes the model to act like a search engine.

It's helpful to think about it in terms of search: the shape of the outputs look like conversation but were actually prompting the model to surface information from the QKV matrices internally.

Does it feel familiar? When we brainstorm we usually chart graphs of related concepts e.g. blueberry -> pie -> apple.

link

p1esk 849 days ago

It’s not clear that more tokens are better.

link

frannyg 848 days ago

Ok, thanks. My misconception kind of prohibited the insight of a potential (theoretical) assert statement, which is kind of what is meant by

> if the [resulting] dataset "fits" the model architecture properly,

right?

I have too many questions. It seems unreasonable to ask away and I should instead read the studies and some books.

link