|
|
|
|
|
by anonymoushn
387 days ago
|
|
Cool, I can have a look. By the way, I've filed a PR against the pathpiece repository correcting an error in the docs. I have read this paper before, and now that I have looked more closely I am a bit disappointed that the precise method of construction of the large initial vocabularies used for vocabulary learners that begin with large initial vocabularies in the experiment is not specified. I also notice that the same error I have proposed correcting in the docs is present in the paper in section 3.1. Additional edit: Alright, I see that A.2.1 specifies that the "n-gram" initial vocab condition means the most frequent 2^18 n-grams for n in 1..L. This seems like not very many, and not weighting n-grams by "potential CTC reduction" (=length-1) for inclusion in this initial vocab seems to create a bias against long n-grams. I'm also sort of wary of these greedy learners, because many candidates for inclusion in the vocab are interdependent with other candidates (the marginal loss or gain of a candidate may be near 0 because of the presence of some other token which may not actually be optimal to include, or there may be groups of included tokens which cause other better groups to have near 0 marginal loss or gain), so I would really want to avoid being too strongly path-dependent in the learner. If I may hand wave a great deal, I think this sort of issue also arises if your objective function is not CTC. |
|
With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization.
Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.