Hacker News new | ask | show | jobs
by jmclnx 167 days ago
If I understood what I was reading about German Strings, I think UTF-8 could add complications to these things.
3 comments

Not really, no.

The main difference is that you don't know how many code points you have in the prefix as they use variable encoding so it can be up to four but as little as one. I imagine the choice of four bytes for the prefix was actually done specifically for this reason. That's the maximum length of a UTF-8 code point.

The length is not the number of characters anymore but just the size of the string.

Apart from that, it should work exactly the same.

We chose 4B because that was the maximum number of bytes that would be unused otherwise (4B for the length, 8B for the pointer leaves 4B), the UTF8 encoding doesn't really matter.

Also, for UTF8 specifically, cutting code points in half is fine as long as all strings are valid UTF8. The UTF8 encoding is prefix free, i.e., no valid code point is a prefix of another valid code point, so for prefix matching we can usually just compare bytes.

It only gets more complicated if you add collations or want to match case-insensitively. But at that point you need to take into account all edge cases of the Unicode spec anyway.

> We chose 4B because that was the maximum number of bytes that would be unused otherwise

I'm sure you did but there is something funny reading this phrase while at the same time considering you have robbed two bits from your pointer to represent class - admittedly the only thing I find questionable in your design.

If that's the case it's a happy accident because having a full code point here is quite nice.

That's a good point. We just use pointer tagging in many different places (e.g. for pointer swizzling in our buffer pages), so including a few bits of information in a pointer just seemed obvious.
They just store bytes. A leading astral codepoint means your prefix store contains just one codepoint, but that doesn't really change anything per se.
I think statistically still those short string fall in the lower 128 codes which is ascii.