| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by xonix 1399 days ago
	This sounds reasonable. When the GoAWK creator tried to add Unicode support through UTF-8 he discovered that this had drastic performance implications (rendering some algorithms to be O(N^2) instead of O(N)), if done naive https://github.com/benhoyt/goawk/issues/35. Therefore the change was reverted till the more efficient implementation can be found.

1 comments

benhoyt 1398 days ago

Yes, that's right. With my simplistic UTF-8-based implementation it turned length() -- for example -- from O(1) to O(N), turning O(N) algorithms which use length() into O(N^2). See this issue: https://github.com/benhoyt/goawk/issues/93

Similar with substr() and other string functions, which when operating as bytes are O(1), but become O(N) when trying to count the number of codepoints as UTF-8.

GNU Gawk has a fancier approach, which stores strings as UTF-8 as long as it can, but converts to UTF-32 if it needs to (eg: the string is non-ASCII and you call substr).

It looks like Brian Kernighan's code has the same issue with length() and substr(). I'm going to try to email him about this, as I think it's kind of a performance blocker.

benhoyt 1398 days ago

Brian sent me a lovely (and surprisingly lengthy) reply. He noted that length() was already O(N), as onetrueawk just represents strings as NUL-terminated C strings and uses strlen() (though strlen is a lot faster than decoding UTF-8). However, he said that substr() and index() have indeed gone from O(1) to O(N), as it used to be possible to just index by number of bytes.

The most challenging part, he said, was updating the regex engine, which "was written by Al Aho in 1977 and hasn't really been touched much since".