| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cschmidt 379 days ago
	Regarding $O(n L^2)$ vs $O(n L)$, that was because we somewhat sloppily tend to use the term 'tokenization' for both training a tokenizer vocab, and for tokenizing a given document. In the paper, we tried to always call the latter one segmentation or inference. The former is $O(n L^2)$ per iteration, while the latter $O(n L)$. I'll update the README to be more explicit about this.

1 comments

anonymoushn 379 days ago

No, the segmentation algorithm you have implemented has runtime O(N*L^2). In particular, if you want to do a hash table lookup using a string key, that takes time proportional to the length of the string, not constant time.

link

cschmidt 378 days ago

That's in interesting point. While your correct, of course, it is so common to consider a hash table lookup a O(1) operation, it never occurred to me. But in this case, the loops are actually really tight and the hash table lookup might be a significant part of the time, so it might well behave more like O(n L^2). I'll update the docs and paper.

link