Hacker News new | ask | show | jobs
by geofft 3581 days ago
> suggestions for speeding it up in the tracker is to just remove that and work on raw bytes (FFS)

This is valid, because UTF-8 was designed to make this valid. The UTF-8 encoding of a comma, 0x2C (also the ASCII encoding of a comma), does not appear as a part of any other UTF-8 encodings. Same with the UTF-8 encoding of the double quote, 0x22. So scanning for 0x22 and 0x2C bytes, without stopping to decode other UTF-8 sequences along the way, will produce the correct result for a valid UTF-8 input string. Then you fully decode UTF-8 for the individual fields when needed (and if you're doing a string-compare for some target value that's already UTF-8, you never need to decode UTF-8 for that field at all).

4 comments

> and if you're doing a string-compare for some target value that's already UTF-8, you never need to decode UTF-8 for that field at all

Is Go's internal representation of the target string UTF-8?

> Is Go's internal representation of the target string UTF-8?

Kinda but kinda not, a Go string is actually an arbitrary bag of bytes, but some API (such as unicode/utf8 or `range` to iterate on codepoints — runes in Go parlance) assume it's proper UTF8.

Go's implementation allows the caller to designate any UTF8 character as the delimiter, not just as ascii.
For string comparisons, wouldn't case sensitivity and normalization be an issue in some contexts?
That's cool about utf8 - what downsides are there to not treating utf-8 as raw bytes?
The big things are related to string length not matching byte count. strlen() is O(n) because you have to see how many sequences are actually in the string. More than that, splitting/slicing/indexing a string based on byte offsets doesn't work. For a 100-byte ASCII string, you're guaranteed that you can split it into two 50-byte strings and things will still work: you can output them separately, you can get the total length by adding strlen() on each half, you can find a character by doing strchr() on each half, etc. For a 100-byte valid UTF-8 string, splitting it into two 50-byte strings will possibly get you an invalid string, because a character could be split in half. So strlen() (even a UTF-8-correct strlen()) and strchr() don't compose. Outputting a string in two halves works properly as long as the receiver buffers its input, and is willing to wait to reconstruct a partial character.

A related problem is that in older UNIX terminals, pressing backspace would delete one byte, not one character. Newer UNIX kernels have code in the terminal implementation to decode UTF-8 enough to backspace an entire character.

To clarify, Letting the length of a UTF-8 string in Go is O(1); it's computed and stored on the string header at creation.
To clarify even more: that length is the number of bytes (or UTF-8 code units) in the string. It doesn't corresponding to the number of characters (which one may either consider to be Unicode codepoints, or more technically correct, Unicode grapheme clusters).

If you want to count the number of codepoints in a string (called "rune" in Go), then you need to do so explicitly: https://golang.org/pkg/unicode/utf8/#RuneCountInString

Touché