| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Avernar 3212 days ago

> I definitely need indexes

No you don't. You need iterators, which behave like pointers. Let's say you're hundreds or thousands of characters into a string at the start of some token. Now you want to scan from that position to the end of the token.

With indexes it works fast only if it's by codepoint. in a language that properly supports graphemes this would mean it would have to scan from the beginning to get to that index.

With iterators it can start scanning from that position directly. Same speed no matter where you are in the string. With indexes the larger your input the slower your parse gets, and not in a linear way.

It's also super easy to get a slice using a start and end iterator. As for line x character y messages, you can't get that directly from an index as it depends on how many new lines you parsed so indexing doesn't help there.

1 comments

jstimpfle 3212 days ago

Well, I could roll my own iterator which encapsulates a string and some position information, but then I'd have to wrap a lot of different operations, like advance, advance by n, compare two iterators by position, test for end position, extract character, extract slice, etc.

And the code would get a lot noiser, while the only advantage I see is graphemes support, which I have never needed so far. (And I hope graphemes are actually designed with a similar sensibility for technical concerns as is UTF-8, where I can simply parse with indexes at the byte level, looking only for ASCII characters, without headaches and with maximum performance.)

As for getting line/character from a byte or codepoint offset, that's no problem if I do the calculation only in case of an error. The alternative would be to do it on each advance, which again means ADT wrapping, thus line noise and slower performance.

Avernar 3210 days ago

I'm not advocating that the programmer needs to implement the iterators but that the language/runtime have built in support for them.

As for searching for ASCII, which is prevalent in parsing, the iterator function to find the next specified character can do a low level and fast byte search. That's one of the benefits of UTF-8, searching for ASCII characters is super fast.

You wouldn't have to do the character position on each advance. Just have a beginning of line iterator that's updated every time you see a newline character and on error you do call a function that gives you how many characters between the current position iterator and the start of line iterator.

Working with iterators is no more coplex than working with indexes. But it's the language that needs to provide them.