Hacker News new | ask | show | jobs
by painted-now 1063 days ago
Can anyone recommend some paper or overview on how "sampling" / "decoding" is done in the e2e neural network age? I know how decoding was done for machine translation and speech recognition back in the HMM times (i.e. https://en.wikipedia.org/wiki/Viterbi_algorithm and https://en.wikipedia.org/wiki/Beam_search). These days I get the impression people just do "greedy" - but I don't really know. Any recommendations for info on that topic?

Edit: Forgot Viterbi

2 comments

Its greedy and random :) Instead of a paper, I would recommend the algorithms of most LMM implementations (rwkv.cpp has a relatively clean implementation in python https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv/samp...)
I guess I need to sit down and study this stuff in more detail, but do I understand correctly that the code you shared makes the decisions for each position independently? I am just astonished that this produces any coherent output. Also it is not clear to me how the length of the output sequence is determined.
Once the stop token is likeliest
Just reading through the GPT4 documentation it doesn’t seem like there’s a ton of difference with what you’ve mentioned.

https://platform.openai.com/docs/api-reference/completions/c...

Of course we now know that GPT4 is a Mixture of Experts, so under the hood they’re parallelizing computation. They also include a way to modify the logits with presence/frequency penalty terms.