|
|
|
|
|
by yurymik
2723 days ago
|
|
But neither AsciiString. It has a as_str method, but it's still a kludge. This example was basically a suggestion to throw0u1t: if they want to cut in the middle of the utf-sequence for whatever reason, they can [edit:] do it without extra crates. What I don't understand is why slices are indexed in bytes and not in objects. If String has an ability to check that we're cutting in the middle of the character sequence, why doesn't it provide an ability to take 3 fully formed characters. |
|
If you want to find the one millionth codepoint of a UTF8-encoded string, you have to more or less (1) visit every byte of the string.
If, on the other hand, you want to find the codepoint that covers the millionth byte, on the other hand, you have to read at most four bytes (read the millionth byte, and there are three cases:
- it’s a full codepoint. If so, you‘re done.
- it is the first byte of a multi-byte codepoint. If so, read forwards in the string for up to 3 continuation characters.
- it is a continuation character. If so, search backwards in the string for the first byte, then, if necessary, read forwards to find more continuation characters.
So, that is O(1)
(1) you can skip continuation characters, but these typically are rare.