|
|
|
|
|
by cschmidt
380 days ago
|
|
It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods. See line 233:
https://github.com/google/sentencepiece/blob/master/src/unig... I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now. |
|
https://github.com/google/sentencepiece/blob/master/doc/opti...