Hacker News new | ask | show | jobs
by msbarnett 1442 days ago
> So let me just first say that just sending a CSV/JSON/YAML/whatever file to professional translators and expect good results back is just not going to work. We've done that and sometimes the context is just horribly wrong. The only way to get good results is for the translators to actually see the UI or even better run the app themselves.

You give them some context, and let them ask you questions if they feel things are too ambiguous for them to produce an accurate translation for the context it will be used in. In some cases we will include a screenshot of the rendered English page/component/etc so that the translator can map the key values they're seeing to the presentation context.

I can only tell you that this process has scaled to 10s of millions in sales in foreign languages, and that the translation services we use absolutely do not have any time or interest in signing additional NDAs around source code, in getting their employees set up with bespoke code and dev environments, etc. It would be a gigantic drag on their business model.

> I absolutely think the right way is to have translations be HTML fragments.

These translators do not know HTML and are not going to be able to work with it in any way – again, this would require the services to totally overhaul their business model, and spend a bunch of money/time on training or hiring more specialized translators with HTML/CSS skills, which they have no interest in doing.

It would also open up a threat model that's currently non-existent for us. Total non-starter.

> How else would you know what part of the sentence should be italic or contain a hyperlink?

Translation keys contain a simple substitutional form that can be replaced on key lookup, so

     some.introductory.paragraph: Call to action: %{click_here}
     a.fancy.link.name: Click to purchase!
in code:

     t('.some.introductory.paragraph', click_here: link(target_url, t('.a.fancy.link.name'))
The developer can inject formatting that way if necessary, etc, although generally speaking this is a really rare use-case in my experience: randomly italicizing or bolding or otherwise styling words in a paragraph looks fairly unprofessional/isn't typically done.
1 comments

My experience is that most translators actually do know basic HTML or can at least translate an english base string containing HTML into their own language without messing it up. CSS would of course not be present, just sematic HTML (or any other kind of "rich text" -- it wouldn't have to be HTML specifically).

I'd argue that one reason rich text translations are rare is because it's such a pain. Just look at any static documentation web site -- styling and links are everywhere. Of course I want that for non-static web sites/apps as well; links to navigate the app, side-bars or popovers with help text and documentation, more links, bulleted lists ...

I'm not sure I understand how your example prevents that HTML threat model you mention, unless the "link" function generates some kinds of magic placeholders that you then replace with HTML in another step you did not mention. If "link" generates an A tag, then you're already trusting the translation with HTML powers anyway (not that I find that much of a problem -- at least not with my approach where XSS via params is not possible).

> My experience is that most translators actually do know basic HTML or can at least translate an english base string containing HTML into their own language without messing it up. CSS would of course not be present, just sematic HTML (or any other kind of "rich text" -- it wouldn't have to be HTML specifically).

I don't know what to tell you other than that it is not my experience at all that translation services offer or even accept HTML as a source-format, and if they did they would no doubt command a significant premium over translators who know the languages but lack such tech skills.

And I absolutely wouldn't trust a third party to directly author HTML we were serving anyways. Manual audits of 3rd-party input aren't enough – your tooling should be automatically protecting you from 3rd parties inserting unsanitized HTML (as below)

> I'm not sure I understand how your example prevents that HTML threat model you mention, unless the "link" function generates some kinds of magic placeholders that you then replace with HTML in another step you did not mention. If "link" generates an A tag, then you're already trusting the translation with HTML powers anyway

Good lord, no – you should never be rendering externally controlled strings directly as unescaped HTML, and that includes strings from 3rd party translators.

The lookup function for translation keys produces instances of an "unsanitized" (tainted) string class which is escaped on rendering, so if "link" in this case takes two arguments (the URL, which will become the href, and the text that will get wrapped in the A tag – the text argument will be completely escaped such that attempting to embed HTML in the translation key .a.fancy.link.name would result in mangled output, eg)

translation file:

    a.fancy.link.name: <script src="some-evil-bitcoin-miner-script.js"></script>Click Here!
HTML template:

    <%= link(foo_service_url, t(.a.fancy.link.name)) %>
would produce the final HTML:

    <a href="http://www.foo.com">&lt;script src=&quot;some-evil-script.js&quot;&gt;&lt;/script&gt;Click Here!</a>
> (not that I find that much of a problem -- at least not with my approach where XSS via params is not possible).

That's... hardly the only threat you face if you have a translator feeding you malicious strings

We're not talking full HTML documents here, just strings with an occasional link or word styling or maybe once in a while a bulleted list. But your experience differs from mine then.

Regarding the link, I was more thinking about how your system handles the some.introductory.paragraph translation and how you differentiate between potential HTML in the translation vs HTML in the click_here variable vs HTML in another potential variable containing user input.

> Regarding the link, I was more thinking about how your system handles the some.introductory.paragraph translation and how you differentiate between potential HTML in the translation vs HTML in the click_here variable vs HTML in another potential variable containing user input.

Well, the differentiation is between strings which are dev-authored and parsed during compilation/boot time – which are trusted (untainted) and thus may contain HTML that's rendered directly – and tainted strings which come in at runtime either from user input or from via the translations lookup (among other things), and which can never be rendered without fully HTML escaping (without the code explicitly untainting them, at least, but that would never survive code review because it's profoundly unsafe to do this).

click_here isn't a "real" variable in the source language, it's just something that the translation API can replace during the translation load. To the extent that it can contain HTML, it can do so if and only if it is bound to an untainted string instance during the translation load – binding it to a tainted instance would cause any HTML that gets inserted into there to get fully escaped. "link" being dev-controlled produces untainted strings, but might itself consume a tainted string for its title (and thus escape that while rendering the title as part of outputing its untainted string), etc.

> how you differentiate between potential HTML in the translation

it's very simple: the translation is not trusted and thus can't contain HTML that gets rendered without being fully escaped, and thus looking like garbage. If you really wanted to style something in the middle of the paragraph (which again, effectively never really comes up in my experience) you would have to split the paragraph into 3 keys: everything leading up to the start of a tag, whatever's inside the tag, and everything after the tag.