Hacker News new | ask | show | jobs
by TylerGlaiel 1206 days ago
> Tokenizing by storing strings in a vector is almost never what you want for high performance code, as it will result in an allocation for each token.

not quite, std::string can store <=22 character strings without needing to allocate (in 64 bit mode at least) (look up short string optimization), 22 characters is actually quite a lot in the context of tokenization, so its not a given that switching to string views would be an improvement here

2 comments

I would expect that std::string_view would still be significantly faster. Copying or moving an std::string with small string optimization is likely going to boil down to a branch (to check if the instance is using the small string optimization) and a memcpy. As opposed to copying or moving an std::string_view, which should be two MOV instructions.
string view is theoretically going to be faster, but its the type of thing you'd really need to profile in actual context to see to what extent that is true or not. I was mainly just pointing out that (small) strings are actually way faster than people would think since they don't actually need to allocate memory

if I was actually tasked with hyper-optimizing a tokenizer I would probably skip past string view and do a pair of U16 indexes instead assuming the input file is less than 65k characters [with a "slow path" that uses U32 instead]. I just think that its probably not actually going to be a whole order of magnitude faster than just using string (unless there's long tokens)

I don’t see why moving std::string needs to branch. You just copy the source and then zero it out, unconditionally.
It depends on your STL implementation's representation of string: https://godbolt.org/z/nMYGYoWbq

* libstdc++ has an internal reference to its own address for the SSO. If the moved-from string was referencing its SSO buffer, the moved-to string needs to use its own address. The branch is differentiating the SSO state from a heap-allocated state.

* libc++ string move can be implemented this way, but the branch ends up happening on access to the string. It still needs to discard the old heap allocated buffer, if need-be as well.

> std::string can store <=22 character strings without needing to allocate

implementation dependent - the c++ standard says nothing on this

sure but this is how pretty much every implementation does it these days, so you should be able to safely rely on that (or just drop in your own version that provides that guarantee if you are really concerned about it)