| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jmclnx 167 days ago
	If I understood what I was reading about German Strings, I think UTF-8 could add complications to these things.

3 comments

StopDisinfo910 167 days ago

Not really, no.

The main difference is that you don't know how many code points you have in the prefix as they use variable encoding so it can be up to four but as little as one. I imagine the choice of four bytes for the prefix was actually done specifically for this reason. That's the maximum length of a UTF-8 code point.

The length is not the number of characters anymore but just the size of the string.

Apart from that, it should work exactly the same.

link

msichert 167 days ago

We chose 4B because that was the maximum number of bytes that would be unused otherwise (4B for the length, 8B for the pointer leaves 4B), the UTF8 encoding doesn't really matter.

Also, for UTF8 specifically, cutting code points in half is fine as long as all strings are valid UTF8. The UTF8 encoding is prefix free, i.e., no valid code point is a prefix of another valid code point, so for prefix matching we can usually just compare bytes.

It only gets more complicated if you add collations or want to match case-insensitively. But at that point you need to take into account all edge cases of the Unicode spec anyway.

link

StopDisinfo910 167 days ago

> We chose 4B because that was the maximum number of bytes that would be unused otherwise

I'm sure you did but there is something funny reading this phrase while at the same time considering you have robbed two bits from your pointer to represent class - admittedly the only thing I find questionable in your design.

If that's the case it's a happy accident because having a full code point here is quite nice.

link

msichert 167 days ago

That's a good point. We just use pointer tagging in many different places (e.g. for pointer swizzling in our buffer pages), so including a few bits of information in a pointer just seemed obvious.

link

masklinn 167 days ago

They just store bytes. A leading astral codepoint means your prefix store contains just one codepoint, but that doesn't really change anything per se.

link

f1shy 167 days ago

I think statistically still those short string fall in the lower 128 codes which is ascii.

link