| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by painted-now 1063 days ago
	Can anyone recommend some paper or overview on how "sampling" / "decoding" is done in the e2e neural network age? I know how decoding was done for machine translation and speech recognition back in the HMM times (i.e. https://en.wikipedia.org/wiki/Viterbi_algorithm and https://en.wikipedia.org/wiki/Beam_search). These days I get the impression people just do "greedy" - but I don't really know. Any recommendations for info on that topic? Edit: Forgot Viterbi

2 comments

spion 1063 days ago

Its greedy and random :) Instead of a paper, I would recommend the algorithms of most LMM implementations (rwkv.cpp has a relatively clean implementation in python https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv/samp...)

link

painted-now 1063 days ago

I guess I need to sit down and study this stuff in more detail, but do I understand correctly that the code you shared makes the decisions for each position independently? I am just astonished that this produces any coherent output. Also it is not clear to me how the length of the output sequence is determined.

link

pizza 1063 days ago

Once the stop token is likeliest

link

janalsncm 1063 days ago

Just reading through the GPT4 documentation it doesn’t seem like there’s a ton of difference with what you’ve mentioned.

https://platform.openai.com/docs/api-reference/completions/c...

Of course we now know that GPT4 is a Mixture of Experts, so under the hood they’re parallelizing computation. They also include a way to modify the logits with presence/frequency penalty terms.

link