| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pennig 5234 days ago
	The code path for that case must be delightful.

1 comments

robin_reala 5234 days ago

Here's the diff:

https://hg.mozilla.org/mozilla-central/rev/bb53aec4a302

Doesn't look too bad

link

masklinn 5234 days ago

It's really weird that they just add special cases like that. Though I expect it's just because they don't have enough special cases yet (went from one — for the Turkish I — to two).

I'd have expected something like a generic Unicode-aware/y text management layer, and CSS text transforms would just go through that layer.

link

mcpherrinm 5233 days ago

The problem is that Unicode doesn't know about language. Unicode is just characters.

Language-aware bits are more gross, but then language often is. It's not nicely structured like most of the other things we encounter when transforming data.

link

masklinn 5232 days ago

> The problem is that Unicode doesn't know about language. Unicode is just characters.

I won't blame you for this, it is a common mistake, but Unicode goes far beyond merely mapping characters to integers. The Standard Annexes, Technical Reports and Technical Specifications cover pretty much all things localization from line breaking [UAX14] to regular expressions [UTS18] through date and time formatting [UTS35] or sorting [UTS10].

And as it turns out, both uppercasing and titlecasing are covered by [UAX44] as part of the SpecialCasing.txt file which provides lower, upper and title-casing (along with optional conditions) for characters with non-trivial mappings (trivial 1:1 mappings are covered in the base UnicodeData.txt file)

[UAX14] http://www.unicode.org/reports/tr14/

[UTS18] http://www.unicode.org/reports/tr18/

[UTS35] http://www.unicode.org/reports/tr35/

[UTS10] http://www.unicode.org/reports/tr10/

[UAX44] http://www.unicode.org/reports/tr44/

link

underwater 5234 days ago

Eek. Not only are they hardcoding the logic but they mix their CSS-specific code into the function. I understand that they are handling a limited number of cases now but if I came across that kind of code in my work I'd be very sceptical.

link

ars 5234 days ago

That's called not over engineering something.

Making something more complicated doesn't make it better. Make it more complicated when you need to, not before.

link

darklajid 5234 days ago

I wonder how the German ß is handled. Having no clue about the implementation of these transforms, wouldn't that be a similar case?

link

lillycat 5234 days ago

Yes, in Firefox the 'esszet' is transformed in SS when in capital letters. But this done since a long time. The dotted and dot-less Turkic, and the Dutch IJ, are new in Firefox 14 (which is the first browser to support it, AFAIK).

There is some specific cases with accented Greek diphthongs, where the diacritic position changes in upper and lower case, but Mozilla is working on a fix.

link