|
|
|
|
|
by ben-schaaf
53 days ago
|
|
There's really an endless list of these optimizations. A few I've used (though not necessarily in rust): Atoms: Each string can be referenced with a single u32 or even u16, and they're inherently deduplicated. Bump allocator: your strings are &str, allocation is super fast with limited fragmentation. Single pointer strings (this has a name, I can't think of it right now): you store the length inside the allocation instead of in each reference, so your strings are a single pointer. |
|
First, on the heap we have a self-indicating length prefix, basically we use the bottom 7 bits of each byte to indicate 7 bits of length and the top bit indicates there are more bits in the next byte. So "ben-schaaf" would be 0x0A then the ASCII for "ben-schaaf"
But, we avoid even having a heap allocation if we have 8 or fewer UTF-8 bytes to encode the text, that's our Small String Optimisation.
To pull this off we specify that our heap allocations will have 4 byte alignment even though they don't need it. This shouldn't be a problem, in fact many allocators never actually deliver smaller alignments anyway.
This means our pointer now has two spare bits, the least significant bits are now always zero for a valid heap pointer. We rotate these bits to the top of the first byte (this varies depending on whether the target is big-endian or little-endian) and we mask them so that for these valid pointers they are 0b10xxxxxx
So, now we can look at the "single pointer" and figure out
If it begins 0b10xxxxxx it really will be a valid pointer, rotate it, mask out that flag bit and dereference the pointer to find the length-prefixed text.
If it begins 0b11111AAA there's a short string here but it didn't need all 8 bytes, the next AAA bytes of the "pointer" are just UTF-8 and conveniently AAA is enough binary for 0 through 7 to be signalled, the exact length we have
If it has any other value the entire 8 bytes of "pointer" is a UTF-8 encoded string