Hacker News new | ask | show | jobs
by mbell 3258 days ago
> I consider the fact that a stupid-simple package was depended-upon by so many mature libraries as an indication it should be a language feature.

Unfortunately neither the npm module nor the browser version really do what most people want and string handling in javascript is still a minefield.

'\u{1F4A9}'.padStart(5, '1') => "111" // oops (\u{1F4A9} is at the end of this, HN filter)

'\u{1F4A9}'.length => 2

[...'\u{1F4A9}'].length => 1 //WTF?

'mañana'.padStart(7, '1') => "1mañana" // ok

'man\u0303ana'.padStart(7, '1') => "mañana" // oops

'man\u0303ana'.length => 7

[...'man\u0303ana'].length => 7 // WTF? Why doesn't this match the behavior of [...'\u{1F4A9}'].length ?

'man\u0303ana'.normalize('NFC').padStart(7, '1') => "1mañana" // OK

I do understand the unicode issues here, but the inconsistency in the APIs from a user perspective and lack of any fully cross browser support for sane string processing in Javascript means we still have only a few options:

1) Don't do string processing in javascript at all.

2) Include a library to make it sane, these are usually huge as they usually need large lookup tables.

3) Accept that things won't always be correct.

This is one example but the lack of a sane standard library in Javascript is one of the biggest problems the web has right now. I'd be curious to know how many bytes of JS are loaded on the average website just to work around the lack of standard library support for basic functionality, I'd bet it's a very large number. Another fun one: Try to parse a URL and append extra query params to it, correctly.

5 comments

    'mañana'.padStart(7, '1') => "1mañana" // ok
    'man\u0303ana'.padStart(7, '1') => "mañana" // oops
You are being disingenuous here. Those are different strings, with different lengths (try copying this into the console):

    'mañana'.length // 6
    'mañana'.length // 7
The latter has two stacked characters. These issues are inherent to Unicode and `padStart` is treating the strings correctly. If you need normalization, use the .normalize method you mentioned yourself.

This is a major improvement: double-wide and stacked characters have been there since ES3, but now the language is providing standard tools to work with them.

> If you need normalization, use the .normalize method you mentioned yourself.

If it were so simple...

`normalize` doesn't exist in IE at all and not in Safari < 10 so to take this advice we need a polyfill. As you may expect, polyfilling unicode normalization isn't pretty, it requires a massive lookup table.

The best polyfill out there, unorm, clocks in at ~38KB gzipped. Now, keep in mind there are a half dozen or more iframes on many web pages, each would have to load their own copy and it's unlikely the caching would overlap for a number of reasons. Also keep in mind that code builds / loading based on browser support isn't realistic in many cases, so if I want to use normalize, everyone pays the network bandwidth usage penalty not just the IE11 users. Of course this is only one part of the problem, want to iterate over graphmeme clusters? That'll be another massive library. Etc, etc.

The browser JS ecosystem is full of these problems, it's not just text processing. If you've ever wondered why a site needs to load 2MB of javascript, it's because that's about what is needed to create a cross browser compatibility layer and a reasonable standard library.

> Also keep in mind that code builds / loading based on browser support isn't realistic in many cases, so if I want to use normalize, everyone pays the network bandwidth usage penalty not just the IE11 users.

Switch to loading it via JS modules and using HTTP2 to keep connection lag low on cellular 3G connections? I agree, more needs to be done to promote these kinds of edge cases. A similar problem occurs with locale-aware date parsing and formatting.

Let's not talk about Javascript date/time.
I don't think that it's really fair to say that he's being disingenuous here, regardless of the underlying byte form the strings look indistinguisable and often users (devs) will expect them to function as such.

I think it would be less confusing to define .length as the number of characters and have an additional .size method returning the number of bytes (I'm assuming that's what .length returns, if not it's even more confusing).

Of course, that already wasn't done - meh.

> have an additional .size method returning the number of bytes (I'm assuming that's what .length returns, if not it's even more confusing).

It's actually not the number of bytes, it's the number of...'codepoint pieces' is what it could be called I guess? Javascript's language level string implementation is something like UCS-2 with the addition of surrogate pairs being allowed, but counted as separate 'characters' for things like length and index access. It's some twisted middle ground between UCS-2 and UTF-16.

That seems deranged to me. Like a true length calculation, It still requires a complex (albeit cachable) calculation to resolve, but it fails to return the length of the string in terms of the number of characters as they would be natually presented.

I understand a need in some contexts to distinguish between a character and its subsequent modifiers - but I do not see such a context here.

Design by committee?

> [...'man\u0303ana'].length => 7 // WTF? Why doesn't this match the behavior of [...'\u{1F4A9}'].length ?

The key thing to remember is that iteration over Unicode strings only makes sense as iteration over code points, not UCS-2 characters, not bytes, not grapheme clusters. The JS String iterator was very deliberately made to iterate over code points. That length reports UCS-2 characters is a historical mistake. That padding is operating on UCS-2 characters is probably a reflection of the fact that the operation isn't well-defined beyond ASCII.

> The key thing to remember is that iteration over Unicode strings only makes sense as iteration over code points, not UCS-2 characters, not bytes, not grapheme clusters.

There are tons of situations where interating over grapheme clusters is what you want to do.

And tons of situations where you don't want neither of two (e.g. nfd vs. nfc). Cairo graphics library has utilities for text rendering, explicitly called "toy text" functions in reference, leaving serious rendering to Pango. That's fair. Languages should not call unicode strings "unicode strings" if these are not covered in detail by special libraries with distinct names for ucp/ucs/etc lengths, iterators, etc. There is no such thing as string length or "char" anymore. String is blank or non-blank, anything beyond that is too complex to be part of any stdlib. Even "blank" is not so obvious today.
That's why I hated the "it might break stuff" arguments against making "string" interfaces against characters (including combinators), and always using UTF-8 for encoding internally, in memory. Would have made a lot of that easier.

As to your last bit, I tend to favor encodeURIComponent and have done it correctly... the main reason, is to avoid "+" vs " " in query strings.

It seems like certain code points (like '\u{1F4A9}' aka the poop emoji) are a single character but the string report a length of 2. That is the root of all of those problems. One of your "problems", the length of an array with a single string element, isn't a problem.
> One of your "problems", the length of an array with a single string element, isn't a problem.

You're misunderstand the code, it's using the spread operator on a string:

[...'word'] => ["w", "o", "r", "d"]

What is being demonstrated is that under the hood, javascript stores astral plane codepoints as surrogate pairs and strings operate on 'characters' which is why '\u{1F4A9}'.length => 2. But, when the spread operator is applied to a string, it breaks up the string into codepoints, not characters. This is also why [...'man\u0303ana'].length => 7, the combining tilde is a separate codepoint.

This is an example of how wonky string processing in JS is, [...string].length is actually the most straightforward way to get a count of codepoints in a string.

> One of your "problems", the length of an array with a single string element

That's not what it is. Look again, and pay attention to the ... part.

That's really helpful, thanks.
Any Unicode codepoint above U+FFFF will require 2 UTF-16 characters
Holy fuck! The Fractal Of Bad Design all over again...