Hacker News new | ask | show | jobs
by jstimpfle 3206 days ago
> I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great: usually you want to index via grapheme -- if you want to index at all.

I definitely need indexes, and I don't really care about graphemes. I actually have only a vague idea what that is.

I write parsers typically by using a global string and lots of indices. The important thing for me is to be able to extract characters and slices at given positions, and to be able to say "parse error at line X character Y" where X and Y are helpful to the user most of the time.

I would be absolutely fine with working in UTF-8 bytes only (and that would be faster I guess), but there would be a more pressing need to recompute character positions (as a code point or grapheme index) from byte offsets at times.

There are more abstract parsing methods where parser subroutines are implemented in a position agnostic way, but I'm very happy with my simple method.

If everything works on graphemes instead of code points (as I think does Perl6) I will be happy to use that, but it's not so important from a practical standpoint.

3 comments

The issue with your model is that’s graphemes ultimately don’t behave like you may expect a character to. For example, it may take multiple code points to make a grapheme, so getting the index of any particular one might require walking the string instead of a constant time access since any one code point could be “globbed” with its neighbors into a grapheme–in a way that is dependent on its neighbors.
> I definitely need indexes

No you don't. You need iterators, which behave like pointers. Let's say you're hundreds or thousands of characters into a string at the start of some token. Now you want to scan from that position to the end of the token.

With indexes it works fast only if it's by codepoint. in a language that properly supports graphemes this would mean it would have to scan from the beginning to get to that index.

With iterators it can start scanning from that position directly. Same speed no matter where you are in the string. With indexes the larger your input the slower your parse gets, and not in a linear way.

It's also super easy to get a slice using a start and end iterator. As for line x character y messages, you can't get that directly from an index as it depends on how many new lines you parsed so indexing doesn't help there.

Well, I could roll my own iterator which encapsulates a string and some position information, but then I'd have to wrap a lot of different operations, like advance, advance by n, compare two iterators by position, test for end position, extract character, extract slice, etc.

And the code would get a lot noiser, while the only advantage I see is graphemes support, which I have never needed so far. (And I hope graphemes are actually designed with a similar sensibility for technical concerns as is UTF-8, where I can simply parse with indexes at the byte level, looking only for ASCII characters, without headaches and with maximum performance.)

As for getting line/character from a byte or codepoint offset, that's no problem if I do the calculation only in case of an error. The alternative would be to do it on each advance, which again means ADT wrapping, thus line noise and slower performance.

I'm not advocating that the programmer needs to implement the iterators but that the language/runtime have built in support for them.

As for searching for ASCII, which is prevalent in parsing, the iterator function to find the next specified character can do a low level and fast byte search. That's one of the benefits of UTF-8, searching for ASCII characters is super fast.

You wouldn't have to do the character position on each advance. Just have a beginning of line iterator that's updated every time you see a newline character and on error you do call a function that gives you how many characters between the current position iterator and the start of line iterator.

Working with iterators is no more coplex than working with indexes. But it's the language that needs to provide them.

Formal parsers do without indexing, but those rolled by hand often do, for simplicity's sake. I think these cases can still be serviced by permitting indexes, but backing them by a lazily computed index table.