| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cschmidt 380 days ago

It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods.

See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...

I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.

2 comments

mcyc 380 days ago

You can cross whitespace boundaries by setting flag `--split-on-whitespace` to false (it's true by default).

https://github.com/google/sentencepiece/blob/master/doc/opti...

link

cschmidt 379 days ago

Anyone reading this in the future, I meant to say the length weighting is a bit nonstandard. It is usually by frequency. Oops

link