| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ricardobeat 3267 days ago

    'mañana'.padStart(7, '1') => "1mañana" // ok
    'man\u0303ana'.padStart(7, '1') => "mañana" // oops

You are being disingenuous here. Those are different strings, with different lengths (try copying this into the console):

    'mañana'.length // 6
    'mañana'.length // 7

The latter has two stacked characters. These issues are inherent to Unicode and `padStart` is treating the strings correctly. If you need normalization, use the .normalize method you mentioned yourself.

This is a major improvement: double-wide and stacked characters have been there since ES3, but now the language is providing standard tools to work with them.

2 comments

mbell 3267 days ago

> If you need normalization, use the .normalize method you mentioned yourself.

If it were so simple...

`normalize` doesn't exist in IE at all and not in Safari < 10 so to take this advice we need a polyfill. As you may expect, polyfilling unicode normalization isn't pretty, it requires a massive lookup table.

The best polyfill out there, unorm, clocks in at ~38KB gzipped. Now, keep in mind there are a half dozen or more iframes on many web pages, each would have to load their own copy and it's unlikely the caching would overlap for a number of reasons. Also keep in mind that code builds / loading based on browser support isn't realistic in many cases, so if I want to use normalize, everyone pays the network bandwidth usage penalty not just the IE11 users. Of course this is only one part of the problem, want to iterate over graphmeme clusters? That'll be another massive library. Etc, etc.

The browser JS ecosystem is full of these problems, it's not just text processing. If you've ever wondered why a site needs to load 2MB of javascript, it's because that's about what is needed to create a cross browser compatibility layer and a reasonable standard library.

link

lstamour 3267 days ago

> Also keep in mind that code builds / loading based on browser support isn't realistic in many cases, so if I want to use normalize, everyone pays the network bandwidth usage penalty not just the IE11 users.

Switch to loading it via JS modules and using HTTP2 to keep connection lag low on cellular 3G connections? I agree, more needs to be done to promote these kinds of edge cases. A similar problem occurs with locale-aware date parsing and formatting.

link

lilbobbytables 3267 days ago

Let's not talk about Javascript date/time.

link

yarg 3267 days ago

I don't think that it's really fair to say that he's being disingenuous here, regardless of the underlying byte form the strings look indistinguisable and often users (devs) will expect them to function as such.

I think it would be less confusing to define .length as the number of characters and have an additional .size method returning the number of bytes (I'm assuming that's what .length returns, if not it's even more confusing).

Of course, that already wasn't done - meh.

link

mbell 3267 days ago

> have an additional .size method returning the number of bytes (I'm assuming that's what .length returns, if not it's even more confusing).

It's actually not the number of bytes, it's the number of...'codepoint pieces' is what it could be called I guess? Javascript's language level string implementation is something like UCS-2 with the addition of surrogate pairs being allowed, but counted as separate 'characters' for things like length and index access. It's some twisted middle ground between UCS-2 and UTF-16.

link

yarg 3266 days ago

That seems deranged to me. Like a true length calculation, It still requires a complex (albeit cachable) calculation to resolve, but it fails to return the length of the string in terms of the number of characters as they would be natually presented.

I understand a need in some contexts to distinguish between a character and its subsequent modifiers - but I do not see such a context here.

Design by committee?

link