| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tialaramex 5 days ago

Sure, to simplify lets assume a 64-bit CPU (this all works for 32-bit but that's less common these days and the actual numbers are different)

C++ std::string can contain up to 15 (other popular implementations) or 22 bytes (libc++ from Clang) of inline text, and the data structure itself is either 24 bytes (Clang again) or 32 bytes of storage. Here's Raymond Chen: https://devblogs.microsoft.com/oldnewthing/20240510-00/?p=10...

CompactString is 24 bytes of storage with all 24 bytes as potential inline text. When the 24 bytes are valid UTF-8 text, then that's the content of the CompactString e.g. "https://example.org/cool", if they aren't the last byte will be invalid UTF-8, and this signals whether some of the other 23 bytes were inline UTF-8 (and if so how many) or whether they should be interpreted as a pointer, size and capacity.

ColdString is a radically different idea, it's 8 bytes of unaligned storage and it's one of three things: 1. 8 bytes of UTF-8 text, as before we can tell by whether it's valid UTF-8 text or 2. 0-7 bytes of UTF-8 text, prefixed by an invalid UTF-8 byte telling us how many of the remaining bytes are text or 3. An encoded pointer to a length-prefixed data structure, signalled by the presence of the UTF-8 continuation marker bits which should never be present in the first byte of a string.

I really like ColdString because it's so much in the "use the whole buffalo" spirit of these modern safe yet high performance types. UTF-8 has what are called "overlong prefixes" because it was invented before Unicode decided it would never grow beyond U+10_FFFF and these are often just a useless impediment, but ColdString uses those prefixes.

1 comments

drysine 4 days ago

Thanks!

How do CompactString/ColdString compare to std::string implementations performance-wise? From the looks of it, they must be somewhat slower than C++ strings

link

tialaramex 4 days ago

I do not have hard numbers - however keep in mind that practical "performance" also includes memory bandwidth and total RAM, this is especially a consideration for the ColdString type - a billion ColdStrings is 8GB of RAM, but a billion MSVC std::string needs 32GB of RAM. Rust's std::string::String is of course much faster than any of the std::string implementations because it never has the SSO case to consider - but for non-empty strings it's also more memory bandwdith and RAM needed.

link