| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by adambatkin 1399 days ago

Take a step back and ask "why do I need to know the length of this thing?" If it's so you know how much storage to allocate, then the process/time is the same for both (the answer is: however many bytes you happen to have). If your array is bigger than the number of valid characters (code points) and you need to search through it to find the "end" (last valid character), you can do that with almost identical complexity (you don't actually need to iterate over ever byte and Code Point with UTF-8 because of how elegantly the encoding was designed).

Why else might you need to know the length? If it's to know how much space to allocate in a GUI (or even on a console) then neither encoding is going to help.

Maybe it's because of some arbitrary limitation like "your name must be less than 50 characters" and I'll just say that if that's the case, you are doing it wrong (if you need to limit it for storage/efficiency purposes, fine, but you will probably be better off limiting by bytes and using UTF-8 since most people will be able to squeeze in more of their names).

I'm not saying there aren't reasons for needing to know the "length" (number of Code Points) of a string, and certainly many existing algorithms are written in a way that they assume that calculating string length and being able index arbitrarily into the middle of a string are fast (O( 1 ) for indexing) but in reality, for almost any real world problem beyond "how much storage do I need" almost everything you need to do actually requires iterating over a string one Code Point at a time (which is O( n ) for both, with the biggest difference being that UTF-8 may require more branching, but also it's common enough that in many cases between vectorization and just generally better optimizations because of it's popularity, UTF-8 will do just fine while usually using less storage, which can significantly benefit CPU cache locality).

1 comments

alganet 1399 days ago

It needs the length for operations such as substring, or to apply length modifiers on regular expressions (such as \w{3,5}), which is a common thing in awk programs.

In fact, the return value of the u8_rune as implemented in the branch we are discussing (https://github.com/onetrueawk/awk/compare/unicode-support) returns a length to be used as an offset later.

This is not me saying, it's the author. There is a code comment there:

> For most of Awk, utf-8 strings just "work", since they look like null-terminated sequences of 8-bit bytes. Functions like length(), index(), and substr() have to operate in units of utf-8 characters. The u8_* functions in run.c handle this.

I know there might be different ways of doing it, but we're talking about a specific implementation.

I was wrong to assume he is storing stuff in UTF-32. He could have, but there was already code in place there to make the UTF-8 storage easier to implement.

link