|
|
|
|
|
by msichert
166 days ago
|
|
We chose 4B because that was the maximum number of bytes that would be unused otherwise (4B for the length, 8B for the pointer leaves 4B), the UTF8 encoding doesn't really matter. Also, for UTF8 specifically, cutting code points in half is fine as long as all strings are valid UTF8. The UTF8 encoding is prefix free, i.e., no valid code point is a prefix of another valid code point, so for prefix matching we can usually just compare bytes. It only gets more complicated if you add collations or want to match case-insensitively. But at that point you need to take into account all edge cases of the Unicode spec anyway. |
|
I'm sure you did but there is something funny reading this phrase while at the same time considering you have robbed two bits from your pointer to represent class - admittedly the only thing I find questionable in your design.
If that's the case it's a happy accident because having a full code point here is quite nice.