|
|
|
|
|
by dnnssl2
936 days ago
|
|
How does one select a good candidate for the draft model in speculative decoding? I imagine that there's some better intuition than just selecting the next parameter count down (i.e 70B -> 13B, 13B -> 7B). Also how does that interact with MoE models? Do you have a mini version of the MoE, with smaller experts? |
|
Anecdotally, folks often seem to use say, 70B base + 7B as verifier. But I think there's a lot of room for experimentation and improvement here.
You could... say, take a 70B model and maybe just chop off the last 90% of layers and then fine-tune. Or perhaps you could use a model that's trained to generate 8 tokens at once. Or perhaps you could just use statistical "n-gram" predictor.