Hacker News new | ask | show | jobs
by rodelrod 1307 days ago
This space before the colon and other marks in French should actually be a narrow non-breaking space (U+202F) [0]. There's no key for it in the AZERTY layout.

This has been a problem since the typewriter age. People having to get on with their jobs coped with it by using a full, breaking em-space. Unless this gets replaced automatically by the word processor, you get horrid typography and misplaced line breaks all over the place.

The Académie Française should have dealt with this years ago, if their ass wasn't stuck in the 17th century.

[0] https://www.compart.com/en/unicode/U+202F

7 comments

A modern solution is simply to have non-breaking space easily accessible in your keyboard layout for when you need it. In the BÉPO layout this is at SHIFT + space. Especially simple since all double punctuations (:?!; those that require an nbsp before them) are also accessed with SHIFT.
Better still, why not abandon the space as a character, and render the colon with extra space to the left when locale is French?
Azerty layout could certainly map shift-space to unbreakable space too

But honestly, rather than changing keyboards (which is hard), why doesn't Google just pick a shorthand that doesn't break typography rules, like `@` instead of `:`

The : allows users to discover it by coincidence when trying to type a smily :)

If triggering it by accident wasn't intended, they could have used a regular keyboard shortcut like Alt+Shift+E or even instruct users to press Win+. to use the system-wide emoji menu.

A modern solution is to abandon old, no-longer-relevant typographic language rules, or to make typographic language rules context-specific.

But I agree that we need to make several alternative space characters easy to type:

  - non-breaking space (for this French rule)
  - wide space (for disambiguating sentence
    ending periods from non-sentence-ending
    periods)
  - zero-width non-breaking space (for
    preventing word-splitting?)
What makes the French rules old, no-longer-relevant, invalid? Why should we change language to appease lazy software developers?
Not "rules", just this rule. It has to do with typographic considerations that apply to old typesetting technologies that are no longer in use.

> Why should we change language to appease lazy software developers?

It's been done.

For example, in Spanish it is no longer the rule that "ch" and "ll" sort as if they were distinct letters (this change was made in 2010) precisely because that was such a difficult rule to implement. And that was a 256 year-old rule per-wikipedia:

  The digraphs "ch" and "ll" were
  considered single letters of the
  alphabet from 1754 to 2010 (and
  sorted separately from "c" and "l"
  from 1803 to 1994).
For another example, in Spanish capital letters were required to not carry accents, but now they are allowed to not carry accents. This was due to the use of overstriking on typewriters working to accent lower-case letters but not upper-case letters (apostrophe would collide with the glyphs for upper-case vowels). But the technology to resolve this has existed in the Spanish-speaking world for a long time now, so the rule was finally dropped. (Not accenting upper-case letter can lead to ambiguities that are annoying.)

It's not just precedent. It's that the original reason for some typographic (not even orthographic) rule is simply not relevant in 2022.

And it's not unreasonable for either French people, non-French French speakers, or just non-French French-non-users to propose the ditching of hard-to-implement French rules. Now, this particular rule is decidedly not difficult to implement, but it is an annoying rule to apply as a user -- I should know, since I speak and write French (though I am not French).

Also, it doesn't matter what the French Academy says, or what the Spanish Royal Academy says, or what Webster's dictionary says, or whatever. Language evolves, even to their consternation. Moreover, developers don't have to care that much -- I18N/G11N is fun enough, and employers have to care for legal reasons, but rules like the Spanish ch/ll rule can be much too hard even for non-lazy developers, and the Royal Spanish Academy can and did have to change, and it was for the better.

In contrast, the Hungarian "cs", "dz", "dzs", "gy", "ny", "sz", "ty", and "zs" all remain distinct, as do their accented vowels. Polish is hybrid, considering digraphs as being 'composed' of single letters (i.e. 'sz' = 's+z'), rather than being distinct, but the accented characters are considered distinct from their unaccented cousins.

This problem has primarily come about because the Catholic church enforced the latin alphabet on languages where a different alphabet might have been more appropriate. Spanish, although more closely related to Latin, still has a few sounds which there's not a good latin character for. There's no (particular) reason (as far as I'm aware) that 'ch' and 'll' became diagraphs, while ñ acquired an accent.

Why should traditions change just because it's a bit more difficult to do things the 'old way'? why do we still bother with capital letters at the beginning of sentences? or speling things with two leters when one wil do, or riting silent leters wen you cant tell the difrens? & i dont think we need apostrofees n e more.

Indeed, there's no reason for 'ch' and 'll' to have been distinct digraphs in Spanish and ñ not to have been 'gn' as in other Romance languages. I'm not familiar with the whys of that.

I didn't know that Hungarian had a similar issue.

> Why should traditions change just because it's a bit more difficult to do things the 'old way'? why do we still bother with capital letters at the beginning of sentences? or speling things with two leters when one wil do, or riting silent leters wen you cant tell the difrens? & i dont think we need apostrofees n e more.

I distinguish typographic and orthographic rules. The non-breaking, thin space before punctuation rule is typographic and outdated (i.e., motivated by outdated typographic technology).

I do want some orthographic rules reformed too, but I'm more interested in the ones that are just hard. In particular I'm interested in collation reform because we do often have to collate multi-language text items but with one collation -this is especially true in databases- so having collations for Latin-script-using languages be similar is rather useful. This is also true given that I'm not going to be switching locales when I switch languages -- I speak, read, and write multiple languages, but I never ever change locales.

  >A modern solution is simply to have non-breaking space easily accessible in your keyboard layout 
An even better solution would be --for grown up people who have progressed beyond cave-painting and want to communicate using, you know, actual words-- to be able to disable emojis completely.

Fucking moronic shite that they are. I've seen people on Twatter and FB have entire conversations in bloody emojis. Talk about reverse evolution! Why don't we just go back to grunting and gesturing and have done with it?

I'm surprising myself by saying this, but emoji are pretty useful in certain contexts.

Certainly more useful than the requirement to put a space before a colon, non breaking or not.

I guess you only speak monotonically and avoid interpreting pitch, pauses, and physical cues in conversation as well, since those would be archaic, right?
smiley face. crying face. row of hearts in different colours. strange yellow thing that might be a banana or an arm flexing its bicep. sports trophy. winking face.

My god! --you're right. This is so much better and clearer than using those boring old fashioned words.

Emotions aren't clear either, but there are certain societal expectations to read and respond to them appropriately.
If you fail to put a space before a colon when writing in French, what happens next? Do people point and laugh? You get disciplined? Or would French speakers accept this as a better way to use the colon character?

It looks like this rule is based on old typographic considerations. Much like the Spanish Royal Academy's rule that capitalized letters carry no accents (unlike the opposite French rule that capitalized letters do carry accents!), which stems from typewriters not having accented letters, so one would type a vowel, backspace, then an apostrophe to make an accented vowel, but for capitals there's not enough space so you couldn't and wouldn't overstrike them.

Users and language academies should distinguish typographic from non-typographic language rules, and typographic rules should be context-specific (well technology-specific, since technology is the context).

I don't know, what would happen in English if you didn't capitalize the days of the week? Do people point and laugh? You get disciplined? Or would English speakers accept this as a better way to write the days of the week?

No human language on Earth is in a position where it can laughs at others for their idiomatisms.

I do not think that was their point (to laugh at other languages), but rather, that there may be contemporary situations in which warrant rethinking the idioms of a language. As an aside, I do not think many people who speak english would even notice a lack of capitalization for the days of the week.
> As an aside, I do not think many people who speak english would even notice a lack of

English

I've had colleagues who refuse to use capitals in most cases. And yet their writing is completely comprehensible.
In English most people wouldn't care. You're going to get in more trouble in French schools for this than in U.S. schools (I don't know about UK schools, or elsewhere in the Anglosphere). My impression is that European culture is a lot more sensitive to these things than U.S. culture.
You'd certainly be corrected in the UK and would count as a regular spelling mistake.
Is Esperanto clear of exceptions and illogicalisms?
i never bother capitalising things. i was pulled up on it once, ever.

that person was wrong. :P

Or, word processors could understand that the pattern

"some-chars" + <whitespace> + ":"

must be treated as a single word in French.

(I guess it's more complicated than I imagine it is, alright)

It's more complicated than you imagine it is. Basically you need to know both: whether the current locale is a French locale, and also whether original text was written in a French locale.

The former is easy enough, but also very annoying to multilingual people since one might run in a Spanish locale but occasionally write in French. So that's not a solution.

The latter is... hard to do, because while Unicode has language tags that you can embed in documents, those are deprecated and they were never well supported, and so there's no way to mark-up text as being in one language or another, and a document-wide setting wouldn't be enough nor sufficiently generic and standard and portable.

The best solution here is to relax the French typographic rule (since it isn't needed anymore). But that would take time to filter through to French speakers (writers, and readers) so that they learn to not put that pesky space before punctuation, but also so that they don't complain when it's missing.

Or... you know, this business of emoji pickers could be something you could turn off. Nahhh, that would never fly! (/s)

> Unicode has language tags that you can embed in documents, those are deprecated and they were never well supported,

Huh? MDN doesn’t mention this … why are they deprecated?

Yes, [sadly] deprecated by the Unicode Consortium in Unicode 5.2, and by the IETF in RFC 6082. Nor have they be undeprecated since them (see https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G115...).

IMO we should have tried harder to make Unicode language tags useful and used. But it didn't happen, so they're a thing of the past. Of course, they're still there, and one could attempt to resurrect them, but most likely one would fail.

Choice quotes below:

https://www.rfc-editor.org/rfc/rfc6082

  > RFC 2482, "Language Tagging in Unicode Plain
  > Text" [RFC2482], describes a mechanism
  > for using special Unicode language tag
  > characters to identify languages when needed.
  > It is an idea whose time never quite came.
  > It has been superseded by whole-transaction
  > language identification such as the MIME
  > Content-language header [RFC3282] and more
  > general markup mechanisms such as those
  > provided by XML.  The Unicode Consortium
  > has deprecated the language tag character
  > facility and strongly recommends against
  > its use.  RFC 2482 has been moved to
  > Historic status to reduce the possibility
  > that Internet implementers would consider
  > that tagging system an appropriate mechanism
  > for identifying languages.
  >
  > A discussion of the status of the language tag
  > characters and their applicability appears
  > in Section 16.9 of The Unicode Standard
  > [Unicode52].
https://www.unicode.org/versions/Unicode5.2.0/ch16.pdf (section 9 of that chapter, 16)

  > 16.9 Deprecated Tag Characters 519 The Unicode
  > Standard, Version 5.2 Copyright © 1991–2009
  > Unicode, Inc.  for detailed recommendations
  > on the use of U+FFFD as replacement for
  > ill-formed sequences. See also Section 5.3,
  > Unknown and Missing Characters for related
  > topics.  16.9 Deprecated Tag Characters
  > Deprecated Tag Characters: U+E0000–U+E007F
  > The characters in this block provide a
  > mechanism for language tagging in Unicode
  > plain text. These characters are deprecated,
  > and should not be used—particularly with any
  > protocols that provide alternate means of
  > language tagging. The Unicode Standard recom-
  > mends the use of higher-level protocols, such as
  > HTML or XML, which provide for language tagging
  > via markup. See Unicode Technical Report #20,
  > “Unicode in XML and Other Markup Languages.”
  > The requirement for language information embedded
  > in plain text data is often overstated, and
  > markup or other rich text mechanisms constitute
  > best current practice. See Section 5.10,
  > Language Information in Plain Text for further
  > discussion.
(Reformatting is mine.)
Oh, “tag” made my brain parse as HTML tags. TIL
And thus impossible to use the emoji inserting feature, and other bugs (silently replacing whitespace with different but looking the same whitespace will make for fun issues)
Well, double colon :: is just as convenient, make more sense and not a problem in either French or English. Even in English, you might want to stick an emoji to the previous word. Like Hell::devil:: (my nephew would like that though).
Double colon would make talking about perl even worse (:: is namespace separator) -- it's already bad enough that any namespace starting with D becomes an emoticon in pretty much all online editors.

This stuff should just stop. Operating Systems / Browser vendors should instead standardize on a hotkey to bring up an emoji selector that steals focus to filter via typing and inserts on enter key.

Code should be quoted upfront.

Finding some sequence that also isn’t a valid Perl substring is impossible, in any case.

Indeed. Replace scroll lock with emoji lock
MS Word actually replaces normal whitespaces with non breaking spaces in some cases. Kinda works.
Libreoffice, and I believe ms word too, automatically replaces the space before a colon with a non-breaking one
This sounds like it's not actually true? If it was, the French code page 646 that we used until Unicode finally won would have included a narrow space, but it doesn't. "Regular" computer text in French has only ever used a normal space, even if handwriting and/or "true" typesetting using typesetting solutions like TeX or PageMaker etc. allowed for a narrow space.
FWIW, LibreOffice automatically inserts an actual Unicode NO-BREAK SPACE when I type ":" at the end of a word (if the language is set for French of course). If I insert an actual SPACE and then hit ":", it even replaces the SPACE with a NO-BREAK SPACE.

I'd be surprised MS Word doesn't do the same. No need for a "true" typesetting solution.

I think the point of the GP is that it inserts NO-BREAK SPACE instead of a NARROW NO-BREAK SPACE. If that's the case, it's a bug.
Codepages are from back in the era of little memory available and monospaced fonts.

And another "US English" centered thing; spacing does not really matter in English, but can have functional differences in other languages and scripts.

https://www.youtube.com/watch?v=2yWWFLI5kFU is a fun look at one of the problems with Unicode in general.

This? https://en.wikipedia.org/wiki/Code_page_1010 There’s no space for it nor for many other more useful characters (like â).
This is mostly correct, but I don't see how it contradicts my statement.

Did the 646 standard account for variable-width characters at all?

Not so much contradict as wondering about the claim that it should be a specific Unicode codepoint when Unicode wasn't around when we started "computering" text (and the Académie Française can't have possibly formally declared things in terms of Unicode =)

What are the actual official rules in this case (and are there links to those? Because that'd be fascinating information to read through)?

Best reference I could find is here: https://www.lalanguefrancaise.com/articles/espace-insecable

Actually before the ":" specifically there should be a regular non-breaking space, not a narrow one. Except in Switzerland. Other punctuation marks take the narrow non-breaking space.

Love how it's "recommandé", "pour des raisons esthétiques."

Which I guess means we're completely free to ignore it. Pour mêmes raisons =P

> Especially simple since all double punctuations (:?!; those that require an nbsp before them) are also accessed with SHIFT.

I just tested with `setxkmap fr`. These are not shifted:

    :!;
Only this requires Shift:

    ?
Also French layouts use an inverted number row (although none of those are accessed through that row).
Correction: `setxkbmap fr`.
This is partially false, because there isn't one true AZERTY layout. There are various platform implementations, with MS Windows being the most common.

In fact, french standard body AFNOR actually updated their AZERTY layout standard three years ago to include more characters, including the narrow non-breaking space. In traditional ISO-like fashion, one must pay to access this standard, but you can find an example here:

https://commons.wikimedia.org/wiki/File:KB_-_AZERTY_-_AFNOR....

It's mapped to AltGr + Maj + Space. Now you just need to find how to install/enable this layout on the platforms you care about.

CORRECTION: narrow non-breaking space goes before ";", "!" and "?". Before ":" you should use regular non-breaking space. That is, in France. In French-speaking Switzerland it's a narrow non-breaking space everywhere.

This is the best reference I could find: https://www.lalanguefrancaise.com/articles/espace-insecable