| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Kenji 3498 days ago
	Unicode URLs are the devil. Too many indistinguishable characters. URLs should stay full ASCII imho. And I say that as someone whose language requires non-ASCII symbols. Or, in Bruce Schneier's words: "Unicode is just too complex to ever be secure."

1 comments

rurban 3498 days ago

But think about the poor underrepresented folks using foreign character sets?

You really need to support this 'sub café {} café()' => Undefined subroutine café in your friendly and social programming language, otherwise you will be accused of discrimination. Of course the two é are not normalized.

Which unicode-friendly language does really check for mixed script confusables? Java only is my guess. Even perl6 falls into this trap.

http://unicode.org/reports/tr39/#Mixed_Script_Confusables

link

palunon 3498 days ago

When it is just accents, it's ok. But when your users have a language that uses à radically different alphabet, sometimes they can't even read Latin transliterations.

link

rurban 3494 days ago

agree. but then you need to declare your exoting encoding somehow, such as in perl via use encoding 'greek'; and then the parser does not need to guess about mixed scripts encodings on every identifier. there's only latin and greek valid, everything else invalid.

how many languages even check for mixed script confusables? for dynamic languages this check is much too expensive, but they are leading the "good cause", allowing everything, and checking nothing.

link