| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Avernar 3209 days ago
	What do you mean by "expose UTF-8"? Because nothing about UTF-8 requires that you give byte access to the string. As for indexing, strings shouldn't require indexing period. That's the ASCII way of thinking, especially fixed width columns and such. You should be thinking relatively. For example, find me the first space then using that point in the string the next character needs to be letter. When you build you're code that way you don't fall for the trap of byte indexing or the performance hit of codepoint indexing (UTF-8) or grapheme indexing (all encodings).

2 comments

ubernostrum 3209 days ago

There are real-world textual data types for which your idealized approach simply does not work. As in, it would be impossible or impossibly unwieldy to validate conformance to the type using your approach, because they require indexing to specific locations, or determining length, or both.

For example, I work for a company that does business in the (US) Medicare space. Every Medicare beneficiary has a HICN -- Health Insurance Claim Number -- and HICNs come in different types which need to be identified. Want to know how to identify them? By looking at prefix and suffix characters in specific positions, and the length of what comes between them. For example, the prefix 'A' followed by six digits means the person identified is the primary beneficiary and was first covered under the Railroad Retirement Board benefit program. Doing this without indexing and length operations is madness.

These data types can and should be subjected first to some basic checks to ensure they're not nonsense (i.e., something expected to be a numeric value probably should not contain Linear B code points, and it's probably a good idea to at least throw a regex at it first, but then applying regex to Unicode also has quirks people don't often expect at first...).

link

Avernar 3209 days ago

I don't see why this would be hard with iterators. You have an iterstor to the start of the HICN, either at the start of a or deep in the string. Take a second iterator and set it to the first. Loop six times advancing that iterator checking to see if it's a digit. Then check if the next position is a space.

For the prefix and suffix and how many characters between them you do the above but use the second iterator to find the suffix. Then you either keep track of how many characters you advanced or ask for how many characters between the two.

It's very easy to think about it this way as that's how a normal (non programmer) human would do it. Basically the code literally does what you wrote in english above.

My point being is that iterators are much faster than indexing when the underlying string system uses graphemes. You can do pretty much anyting just as easy or easier with iterators than with indexing. The big exception is fixed width columnar tet files. I've seen a lot of these in financial situations but fortuanately those systems are ASCII based so not an issue.

link

ubernostrum 3209 days ago

You're not really changing anything, though; you're basically saying that instead of indexing to position N, you're going to take an iterator and advance it N positions, and somehow say that's a completely different operation. It isn't a different operation, and doesn't change anything about what you're doing.

If you want to argue that there should be ways to iterate over graphemes and index based on graphemes, then that is a genuine difference, but splitting semantic hairs over whether you're indexing or iterating doesn't get you a solution.

link

Avernar 3208 days ago

If the string is stored as ASCII characters or Unicode code points (UCS-16 or UCS-32) then you are correct that not much changes. But if the string is in UTF-8, UTF-16 or the string system uses graphemes then indexing goes from O(1) to O(N). Every index operation would have to start a linear scan from the beginning of the string to get to the correct spot. With iterators it would be a quick operation to access what it's pointing to and very quick to advance it.

My argument is that iterators are far superior to indexing when using graphemes (or code points stored as UTF-8 but grapheme support is superior). And they don't hurt when used on ASCII or fixed width strings either so the code will work with either string format. No hairs, split or otherwise here.

link

int_19h 3208 days ago

I agree that iterators generally make more sense with strings. But sometimes, you really do want to operate on code points - for example, because you're writing a lexer, and the spec that you're implementing defines lexemes as sequences of code points.

link

Avernar 3207 days ago

That's why the search functions need to be more intelegent. If you pass the search function a grapheme it will do more work. If it notices you just passed in a grapheme that's just a code point it can do a code point scan. And if the internal representstion is UTF-8 and it sees you passed in an ASCII charzcter (very common in lexing/parsing) it will just do a fast byte scan for it.

Now if the spec thinks identifiers are just a collection of code points then it's being imprecise. But things would still work if the lexer/parser you wrote returns identifiers as a bunch of graphemes because ultimately they're just a bunch of code points strung together.

It's only in situations where you need to truncate identifiers to a certain length that graphemes become important. Also normalizing them when matching identifiers would also probably be a good idea.

link

mjevans 3209 days ago

int_19h's approach is still valid for this; you're asking for whole displayed characters which are combined of some (you don't need to know) number of bits in memory across several units of the memory segment(s) that hold the string.

Based on your description, the correct solution is probably to use a structure or class of a more regular format to store the decoded HICN in pre-broken form. If they really only allow numbers in runs of text you might save space and speed comparison/indexing by doing this.

link

ubernostrum 3209 days ago

It's more that I get tired of people declaring that indexing and length operations need to be completely and utterly and permanently forbidden and removed, and then proposing that they be replaced by operations which are equivalent to indexing and length operations.

Doing these operations on sequences of code points can be perfectly safe and correct, and in 99.99%+ of real-world cases probably will be perfectly safe and correct. My preference is for people to know what the rare failure cases are, and to teach how to watch out for and handle those cases, while the other approach is to forbid the 99.99% case to shut down the risk of mis-handling the 0.001% case.

link

mjevans 3209 days ago

When people say they should be removed they mean primitive operations (like a standard 'length' attribute/function, or an array index operator) shouldn't exist for that type.

Just like it is better to have something like .nth(X) as a function for stepping to a numbered node, so to does a language string demand operations like .nth_printing(X) .nth_rune(X) and .nth_octet(X); to make it clear to any programmer working with that code what the intent is.

link

Avernar 3208 days ago

Semantically equivalent yes, access time equivalent for variable width strings no. One of the reasons for Python 3's odd internal string format is because they wanted to keep indexing and have indexing be O(1). The reason why I think replacing indexing with iterators is that it removes this restriction and they could have made the internal format UTF-8 and/or easily added support for graphemes.

I prefer to have a system where 100% of the cases are valid and teaching people corner cases is not required. We all know how well teaching people about surrogate pairs went. And we're not forbidding the 99.99% case but providing an alternative way to accomplish the exact same thing. The vast majority of code uses index variables as a form of iterator anyways so it's not that big of a change.

The main reason people keep clinging to indexing strings is that's all they know. Most high level languages don't provide another way of doing it. People who program in C quickly switch from indexing to pointers into strings. Give a C programmer an iterator into strings and they'll easily handle it.

link

int_19h 3208 days ago

By "expose UTF-8" I mean exposing the underlying UTF-8 representation of the string directly on the object itelf, instead of going through a separate byte array (or byte array view, to avoid copying)

link

Avernar 3207 days ago

Ah, I see. I agree that it would be a bad idea to give acess to the UTF-8 representstion.

As for length in bytes, a good way to handle most use cases regarding that is to have a function that truncates the string to fit into a certain number of bytes. That way you can make sure it fits into whatever fixed buffer and the truncation would happen on a grapheme level.

link