Hacker News new | ask | show | jobs
by alganet 1400 days ago
It needs the length for operations such as substring, or to apply length modifiers on regular expressions (such as \w{3,5}), which is a common thing in awk programs.

In fact, the return value of the u8_rune as implemented in the branch we are discussing (https://github.com/onetrueawk/awk/compare/unicode-support) returns a length to be used as an offset later.

This is not me saying, it's the author. There is a code comment there:

> For most of Awk, utf-8 strings just "work", since they look like null-terminated sequences of 8-bit bytes. Functions like length(), index(), and substr() have to operate in units of utf-8 characters. The u8_* functions in run.c handle this.

I know there might be different ways of doing it, but we're talking about a specific implementation.

I was wrong to assume he is storing stuff in UTF-32. He could have, but there was already code in place there to make the UTF-8 storage easier to implement.