Hacker News new | ask | show | jobs
by Avernar 3212 days ago
I don't see why this would be hard with iterators. You have an iterstor to the start of the HICN, either at the start of a or deep in the string. Take a second iterator and set it to the first. Loop six times advancing that iterator checking to see if it's a digit. Then check if the next position is a space.

For the prefix and suffix and how many characters between them you do the above but use the second iterator to find the suffix. Then you either keep track of how many characters you advanced or ask for how many characters between the two.

It's very easy to think about it this way as that's how a normal (non programmer) human would do it. Basically the code literally does what you wrote in english above.

My point being is that iterators are much faster than indexing when the underlying string system uses graphemes. You can do pretty much anyting just as easy or easier with iterators than with indexing. The big exception is fixed width columnar tet files. I've seen a lot of these in financial situations but fortuanately those systems are ASCII based so not an issue.

1 comments

You're not really changing anything, though; you're basically saying that instead of indexing to position N, you're going to take an iterator and advance it N positions, and somehow say that's a completely different operation. It isn't a different operation, and doesn't change anything about what you're doing.

If you want to argue that there should be ways to iterate over graphemes and index based on graphemes, then that is a genuine difference, but splitting semantic hairs over whether you're indexing or iterating doesn't get you a solution.

If the string is stored as ASCII characters or Unicode code points (UCS-16 or UCS-32) then you are correct that not much changes. But if the string is in UTF-8, UTF-16 or the string system uses graphemes then indexing goes from O(1) to O(N). Every index operation would have to start a linear scan from the beginning of the string to get to the correct spot. With iterators it would be a quick operation to access what it's pointing to and very quick to advance it.

My argument is that iterators are far superior to indexing when using graphemes (or code points stored as UTF-8 but grapheme support is superior). And they don't hurt when used on ASCII or fixed width strings either so the code will work with either string format. No hairs, split or otherwise here.

I agree that iterators generally make more sense with strings. But sometimes, you really do want to operate on code points - for example, because you're writing a lexer, and the spec that you're implementing defines lexemes as sequences of code points.
That's why the search functions need to be more intelegent. If you pass the search function a grapheme it will do more work. If it notices you just passed in a grapheme that's just a code point it can do a code point scan. And if the internal representstion is UTF-8 and it sees you passed in an ASCII charzcter (very common in lexing/parsing) it will just do a fast byte scan for it.

Now if the spec thinks identifiers are just a collection of code points then it's being imprecise. But things would still work if the lexer/parser you wrote returns identifiers as a bunch of graphemes because ultimately they're just a bunch of code points strung together.

It's only in situations where you need to truncate identifiers to a certain length that graphemes become important. Also normalizing them when matching identifiers would also probably be a good idea.