| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mvsin 807 days ago

Something like this does exist, production systems rarely use greedy search but have more holistic search algorithms.

An example is Beam Search:https://www.width.ai/post/what-is-beam-search

Essentially we keep a window of probabilities of predicted tokens to improve the final quality of output.

2 comments

user_7832 807 days ago

Thanks, that's exactly what I was looking for! Any idea if it's possible to use beam search on local models like mistral? It sounds like the choice of beam search vs say top-p or top-k should be in the software and not embedded, right?

link

activatedgeek 807 days ago

If you use HuggingFace models, then a few simpler decoding algorithms are already implemented for `generate` method of all supported models.

Here is a blog post that describes it: https://huggingface.co/blog/how-to-generate.

I will warn you though that beam search is typically what you do NOT want. Beam search approximately optimizes for the "highest likely sequence at the token level." This is rarely what you need in practice with open-ended generations (e.g. a question-answering chat bot). In practice, you need "highest likely semantic sequence," which is much harder problem.

Of course, various approximations for semantic alignment are currently in the literature, but still a wide open problem.

link

yunohn 807 days ago

This is actually a great question for which I found an interesting attempt: https://andys.page/posts/llm_sampling_strategies/

(No affiliation)

link

qeternity 806 days ago

> production systems rarely use greedy search

I have no idea why you say this. Most of our pipelines will run greedy, for reproducibility.

Maybe we turn the temp up if we are returning conversational text back to a user.

link