| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sdeframond 1355 days ago

Or, word processors could understand that the pattern

"some-chars" + <whitespace> + ":"

must be treated as a single word in French.

(I guess it's more complicated than I imagine it is, alright)

3 comments

cryptonector 1354 days ago

It's more complicated than you imagine it is. Basically you need to know both: whether the current locale is a French locale, and also whether original text was written in a French locale.

The former is easy enough, but also very annoying to multilingual people since one might run in a Spanish locale but occasionally write in French. So that's not a solution.

The latter is... hard to do, because while Unicode has language tags that you can embed in documents, those are deprecated and they were never well supported, and so there's no way to mark-up text as being in one language or another, and a document-wide setting wouldn't be enough nor sufficiently generic and standard and portable.

The best solution here is to relax the French typographic rule (since it isn't needed anymore). But that would take time to filter through to French speakers (writers, and readers) so that they learn to not put that pesky space before punctuation, but also so that they don't complain when it's missing.

Or... you know, this business of emoji pickers could be something you could turn off. Nahhh, that would never fly! (/s)

link

gnubison 1354 days ago

> Unicode has language tags that you can embed in documents, those are deprecated and they were never well supported,

Huh? MDN doesn’t mention this … why are they deprecated?

link

cryptonector 1354 days ago

Yes, [sadly] deprecated by the Unicode Consortium in Unicode 5.2, and by the IETF in RFC 6082. Nor have they be undeprecated since them (see https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G115...).

IMO we should have tried harder to make Unicode language tags useful and used. But it didn't happen, so they're a thing of the past. Of course, they're still there, and one could attempt to resurrect them, but most likely one would fail.

Choice quotes below:

https://www.rfc-editor.org/rfc/rfc6082

  > RFC 2482, "Language Tagging in Unicode Plain
  > Text" [RFC2482], describes a mechanism
  > for using special Unicode language tag
  > characters to identify languages when needed.
  > It is an idea whose time never quite came.
  > It has been superseded by whole-transaction
  > language identification such as the MIME
  > Content-language header [RFC3282] and more
  > general markup mechanisms such as those
  > provided by XML.  The Unicode Consortium
  > has deprecated the language tag character
  > facility and strongly recommends against
  > its use.  RFC 2482 has been moved to
  > Historic status to reduce the possibility
  > that Internet implementers would consider
  > that tagging system an appropriate mechanism
  > for identifying languages.
  >
  > A discussion of the status of the language tag
  > characters and their applicability appears
  > in Section 16.9 of The Unicode Standard
  > [Unicode52].

https://www.unicode.org/versions/Unicode5.2.0/ch16.pdf (section 9 of that chapter, 16)

  > 16.9 Deprecated Tag Characters 519 The Unicode
  > Standard, Version 5.2 Copyright © 1991–2009
  > Unicode, Inc.  for detailed recommendations
  > on the use of U+FFFD as replacement for
  > ill-formed sequences. See also Section 5.3,
  > Unknown and Missing Characters for related
  > topics.  16.9 Deprecated Tag Characters
  > Deprecated Tag Characters: U+E0000–U+E007F
  > The characters in this block provide a
  > mechanism for language tagging in Unicode
  > plain text. These characters are deprecated,
  > and should not be used—particularly with any
  > protocols that provide alternate means of
  > language tagging. The Unicode Standard recom-
  > mends the use of higher-level protocols, such as
  > HTML or XML, which provide for language tagging
  > via markup. See Unicode Technical Report #20,
  > “Unicode in XML and Other Markup Languages.”
  > The requirement for language information embedded
  > in plain text data is often overstated, and
  > markup or other rich text mechanisms constitute
  > best current practice. See Section 5.10,
  > Language Information in Plain Text for further
  > discussion.

(Reformatting is mine.)

link

gnubison 1353 days ago

Oh, “tag” made my brain parse as HTML tags. TIL

link

makapuf 1355 days ago

And thus impossible to use the emoji inserting feature, and other bugs (silently replacing whitespace with different but looking the same whitespace will make for fun issues)

link

c80e74f077 1355 days ago

Well, double colon :: is just as convenient, make more sense and not a problem in either French or English. Even in English, you might want to stick an emoji to the previous word. Like Hell::devil:: (my nephew would like that though).

link

troglodynellc 1354 days ago

Double colon would make talking about perl even worse (:: is namespace separator) -- it's already bad enough that any namespace starting with D becomes an emoticon in pretty much all online editors.

This stuff should just stop. Operating Systems / Browser vendors should instead standardize on a hotkey to bring up an emoji selector that steals focus to filter via typing and inserts on enter key.

link

keybored 1354 days ago

Code should be quoted upfront.

Finding some sequence that also isn’t a valid Perl substring is impossible, in any case.

link

makapuf 1354 days ago

Indeed. Replace scroll lock with emoji lock

link

sdeframond 1354 days ago

MS Word actually replaces normal whitespaces with non breaking spaces in some cases. Kinda works.

link

nephanth 1354 days ago

Libreoffice, and I believe ms word too, automatically replaces the space before a colon with a non-breaking one

link