| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kmill 3431 days ago

There are multiple ways of counting "length" of a string. Number of UTF-8 bytes, number of UTF-16 code units, number of codepoints, number of grapheme clusters. These are all distinct yet valid concepts of "length."

For the purpose of allocating buffers, I can see the obvious use in knowing number of bytes, UTF-16 code units, or the number of codepoints. I also see the use in being able to iterate through grapheme clusters, for instance for rendering a fragment of text, or for parsing. Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven't been able to think of one.

I'm not sure about calculating password lengths: if the point is entropy, the number of bytes seems good enough to me!

The password field bug is possibly compelling, but I don't think it's obvious what a password field should do. Should it represent keystrokes? Codepoints? Grapheme clusters? Ligatures? Replace all the glyphs with bullets during font rendering?

(Similarly, perhaps someone could explain why they think reversing a string should be a sensible operation. That this is hard to do is something I occasionally hear echoing around the internet. The best I've heard is that you can reuse the default forward lexicographic ordering on reversed strings for a use I've forgotten.)

3 comments

toast0 3431 days ago

> Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven't been able to think of one.

If you have a limit on the length of a field, it helps to tell the user what it is in a way they understand. For non-technical users, bytes (and the embedded issue of encoding) and code points are both pretty esoteric, but number of symbols is less so. OTOH, SMS has strict data and encoding limits, and people managed with that; also provisioning byte storage for grapheme limited fields is hard: some graphemes use a ton of code points, family emoji and zalgo text are clear examples.

link

paulddraper 3431 days ago

Why do you have a limit on the length of a field?

So it can fit in a database, i.e. with a certain number of bytes?

link

Sean1708 3430 days ago

If that's why you have a limit then please go and change that immediately.

No, this post is talking about having a minimum length on the password for safety reasons (i.e. a limit on the minimum entropy). You're right that a minimum byte length will ensure this, but what happens when your user types in n-1 "things" but their password gets accepted anyway. That's only a minor thing but (and I'm not entirely sure whether this is possible) what about when your user types in n "things" but the password doesn't get accepted because it's actually only n-1 bytes. Now the password won't be accepted and the user has no idea why.

I agree that these are relatively trivial things, but the point is that it's not as simple as "just use the byte length".

link

toast0 3431 days ago

Some limits are technical (and in that case the hard limit is often bytes, but sometimes code units or code points, or broken if you told MySQL utf8 instead of bytes or utf8mb4), but in many cases, the limits are for aesthetic purposes: a post title or a username often is often required to be fairly short to look nice; in an ascii or latin1 world, those limits are usually expressed in terms of characters, but graphemes might be the right thing to limit in a unicode world.

link

kmill 3431 days ago

"Your username must be 1-4cm when printed with 12pt Times New Roman."

I kind of like the idea of minimum length in cm as a password requirement.

link

martin-adams 3431 days ago

What about "Your username must be no longer than 3 seconds when spoke out loud"

or, "Your username must not take more than 0.001ml of ink when printed at 12pt"

link

desdiv 3431 days ago

Without a limit on password length, an attacker can DOS you by forcing you to run your KDF on gigabyte-sized strings.

link

paulddraper 3430 days ago

Giga byte sized strings?

Oh, no. That doesn't make sense. You need to limit by Giga grapheme strings.

link

geocar 3431 days ago

They're only denying service to themselves if you run the KDF locally.

link

jfoutz 3431 days ago

It's a lot like equality. Same pointer? Same value? p and q point to different nodes in a circular list. Does p equal q?

Semantics matter a lot.

link

kmill 3431 days ago

To expand on this point, one resolution to the Ship of Theseus problem is that the point at which the ship stops being the "same" ship depends on how you are going to define "same." "Same" could mean different things depending on what you are trying to do, so this isn't just an it's-just-semantics cop-out. In particular, to borrow something Ravi Vakil once said, a definition is worthless unless it has a use (which in his case, as a mathematician, if it can be used to uncover and prove a theorem). This is what I have in mind: I do not think it is worthwhile to worry about "the true length of a Unicode string" unless there is something you could do if only you could compute it, and I've been trying to think of something but have come up short.

Speaking of equality: in a lecture about logic I once gave, I asked the students whether {1,2} and {1,2} were the same. In a very real sense, they are different because I drew them (or typed them) in different places and slightly differently -- I promise I typed the second {1,2} with different fingers. But, through the lens of same-means-same-elements, they are the same. That is a warmup for {1,2} vs {1,1,2}, and {1,2} vs {n : n is a natural number and 1 <= n <= 2}.

(There's also kind of a joke about how my set of natural numbers might be red and your set of natural numbers might be blue, but the theory of sets doesn't care about the difference.)

link

kccqzy 3431 days ago

That forgotten use might be a special string sorting algorithm such as LSD. Or it could be a trie but input strings have many common prefixes.

link