|
|
|
|
|
by xonix
1399 days ago
|
|
This sounds reasonable. When the GoAWK creator tried to add Unicode support through UTF-8 he discovered that this had drastic performance implications (rendering some algorithms to be O(N^2) instead of O(N)), if done naive https://github.com/benhoyt/goawk/issues/35. Therefore the change was reverted till the more efficient implementation can be found. |
|
Similar with substr() and other string functions, which when operating as bytes are O(1), but become O(N) when trying to count the number of codepoints as UTF-8.
GNU Gawk has a fancier approach, which stores strings as UTF-8 as long as it can, but converts to UTF-32 if it needs to (eg: the string is non-ASCII and you call substr).
It looks like Brian Kernighan's code has the same issue with length() and substr(). I'm going to try to email him about this, as I think it's kind of a performance blocker.