| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ubernostrum 3209 days ago
	You're not really changing anything, though; you're basically saying that instead of indexing to position N, you're going to take an iterator and advance it N positions, and somehow say that's a completely different operation. It isn't a different operation, and doesn't change anything about what you're doing. If you want to argue that there should be ways to iterate over graphemes and index based on graphemes, then that is a genuine difference, but splitting semantic hairs over whether you're indexing or iterating doesn't get you a solution.

1 comments

Avernar 3208 days ago

If the string is stored as ASCII characters or Unicode code points (UCS-16 or UCS-32) then you are correct that not much changes. But if the string is in UTF-8, UTF-16 or the string system uses graphemes then indexing goes from O(1) to O(N). Every index operation would have to start a linear scan from the beginning of the string to get to the correct spot. With iterators it would be a quick operation to access what it's pointing to and very quick to advance it.

My argument is that iterators are far superior to indexing when using graphemes (or code points stored as UTF-8 but grapheme support is superior). And they don't hurt when used on ASCII or fixed width strings either so the code will work with either string format. No hairs, split or otherwise here.

link

int_19h 3208 days ago

I agree that iterators generally make more sense with strings. But sometimes, you really do want to operate on code points - for example, because you're writing a lexer, and the spec that you're implementing defines lexemes as sequences of code points.

link

Avernar 3207 days ago

That's why the search functions need to be more intelegent. If you pass the search function a grapheme it will do more work. If it notices you just passed in a grapheme that's just a code point it can do a code point scan. And if the internal representstion is UTF-8 and it sees you passed in an ASCII charzcter (very common in lexing/parsing) it will just do a fast byte scan for it.

Now if the spec thinks identifiers are just a collection of code points then it's being imprecise. But things would still work if the lexer/parser you wrote returns identifiers as a bunch of graphemes because ultimately they're just a bunch of code points strung together.

It's only in situations where you need to truncate identifiers to a certain length that graphemes become important. Also normalizing them when matching identifiers would also probably be a good idea.

link