Hacker News new | ask | show | jobs
by rohanmehta1 849 days ago
> But this is because larger models tend to be better, not because throwing more compute at an existing model helps it produce better results.

There's a caveat here - allowing the model to produce more tokens (i.e. giving it more compute time to "think") can produce better results. E.g. asking a model to reason before producing an answer, leads to better answers. And the extra tokens = more compute.

4 comments

True! It's important to first understand the fundamentals of what makes an LLM "good" and what makes it fast, but yes, there are lots of techniques you can apply right before and during the inference step that can trade off between speed and capabilities.

Different prompting techniques like what you're describing are one way, and RAG [0] and ART [1] are also in a similar category.

[0] https://stackoverflow.blog/2023/10/18/retrieval-augmented-ge...

[1] https://www.promptingguide.ai/techniques/art

And adding some more here. I don't know if any models doing this but there is the possibility of generating tokens that it does not show to the user. I think there's quite a lot of scope for internal monologue/chain of thought that could provide concise but clever answers. The difficulty in this is the latency while it ponders to itself, but having played with the groq demos. I think there's scope for a decent interactive experience.

The concern people might feel when they realise an ai might have private thoughts is another issue entirely.

That was indeed part of what I wondering about.

Larger and smaller, in my beginner mind, was a difference of much recursiveness the design of the model allowed.

- User request implies knowledge about X. - PULLING in weights for X. - Probability of user knowing about Xm and Xz is low (because the training data says Xm and Xz are PhD-level knowledge or something). - Pulling in weights for an ELI5-level explanation of Xm and Xz ...

I thought, an LLM would do this recursive pulling of weights based on the semantics of the user request, which it does, but it doesn't do that "dynamically" based on "recalculated" weights and regenerated combos of tokens, which could happen if the training data wasn't "frozen" and accessible, which I learned further down in the comments, isn't.

That's why I wondered whether more processing power and or time would benefit this recursive generation and pulling.

Yea, doing thing like Chain of Thought, and/or running a second query to examine the first set of tokens it generated commonly improve answers.