Hacker News new | ask | show | jobs
by msbarnett 1452 days ago
It's a neat idea but by intermixing code, presentation, and data you're going to run into a bunch of issues that the "traditional" approach avoids.

For one thing, we get our translations by handing a yaml file to external contractors. They don't need to squint at a file full of code to distinguish the bits of english that need translating from the bits that don't – they just have to translate the right side of every key, and there's specialized tooling to help them with this.

And for another, even in your toy example in the readme you've now lost a Single Source of Truth for certain presentation decisions. So now when some stakeholder comes to you and says they hate the italicization in the intro paragraph and to lose it ASAP, instead of taking the markup out of a common template that different data gets inserted into, you have to edit each language's version of the code to remove the markup (with all of the attendant ease of making errors that comes along when you lack a SPOT – easy to miss one language, etc). I'd expect these kinds of multiplication-of-edit problems to grow increasingly complex when you scale this approach beyond toy examples.

Basically this seems really hard to scale to large products, and doesn't play well with division of labour.

2 comments

So let me just first say that just sending a CSV/JSON/YAML/whatever file to professional translators and expect good results back is just not going to work. We've done that and sometimes the context is just horribly wrong. The only way to get good results is for the translators to actually see the UI or even better run the app themselves.

But I'm interested to hear how you would solve the presentation issues you mention. I absolutely think the right way is to have translations be HTML fragments. How else would you know what part of the sentence should be italic or contain a hyperlink?

> So let me just first say that just sending a CSV/JSON/YAML/whatever file to professional translators and expect good results back is just not going to work. We've done that and sometimes the context is just horribly wrong. The only way to get good results is for the translators to actually see the UI or even better run the app themselves.

You give them some context, and let them ask you questions if they feel things are too ambiguous for them to produce an accurate translation for the context it will be used in. In some cases we will include a screenshot of the rendered English page/component/etc so that the translator can map the key values they're seeing to the presentation context.

I can only tell you that this process has scaled to 10s of millions in sales in foreign languages, and that the translation services we use absolutely do not have any time or interest in signing additional NDAs around source code, in getting their employees set up with bespoke code and dev environments, etc. It would be a gigantic drag on their business model.

> I absolutely think the right way is to have translations be HTML fragments.

These translators do not know HTML and are not going to be able to work with it in any way – again, this would require the services to totally overhaul their business model, and spend a bunch of money/time on training or hiring more specialized translators with HTML/CSS skills, which they have no interest in doing.

It would also open up a threat model that's currently non-existent for us. Total non-starter.

> How else would you know what part of the sentence should be italic or contain a hyperlink?

Translation keys contain a simple substitutional form that can be replaced on key lookup, so

     some.introductory.paragraph: Call to action: %{click_here}
     a.fancy.link.name: Click to purchase!
in code:

     t('.some.introductory.paragraph', click_here: link(target_url, t('.a.fancy.link.name'))
The developer can inject formatting that way if necessary, etc, although generally speaking this is a really rare use-case in my experience: randomly italicizing or bolding or otherwise styling words in a paragraph looks fairly unprofessional/isn't typically done.
My experience is that most translators actually do know basic HTML or can at least translate an english base string containing HTML into their own language without messing it up. CSS would of course not be present, just sematic HTML (or any other kind of "rich text" -- it wouldn't have to be HTML specifically).

I'd argue that one reason rich text translations are rare is because it's such a pain. Just look at any static documentation web site -- styling and links are everywhere. Of course I want that for non-static web sites/apps as well; links to navigate the app, side-bars or popovers with help text and documentation, more links, bulleted lists ...

I'm not sure I understand how your example prevents that HTML threat model you mention, unless the "link" function generates some kinds of magic placeholders that you then replace with HTML in another step you did not mention. If "link" generates an A tag, then you're already trusting the translation with HTML powers anyway (not that I find that much of a problem -- at least not with my approach where XSS via params is not possible).

> My experience is that most translators actually do know basic HTML or can at least translate an english base string containing HTML into their own language without messing it up. CSS would of course not be present, just sematic HTML (or any other kind of "rich text" -- it wouldn't have to be HTML specifically).

I don't know what to tell you other than that it is not my experience at all that translation services offer or even accept HTML as a source-format, and if they did they would no doubt command a significant premium over translators who know the languages but lack such tech skills.

And I absolutely wouldn't trust a third party to directly author HTML we were serving anyways. Manual audits of 3rd-party input aren't enough – your tooling should be automatically protecting you from 3rd parties inserting unsanitized HTML (as below)

> I'm not sure I understand how your example prevents that HTML threat model you mention, unless the "link" function generates some kinds of magic placeholders that you then replace with HTML in another step you did not mention. If "link" generates an A tag, then you're already trusting the translation with HTML powers anyway

Good lord, no – you should never be rendering externally controlled strings directly as unescaped HTML, and that includes strings from 3rd party translators.

The lookup function for translation keys produces instances of an "unsanitized" (tainted) string class which is escaped on rendering, so if "link" in this case takes two arguments (the URL, which will become the href, and the text that will get wrapped in the A tag – the text argument will be completely escaped such that attempting to embed HTML in the translation key .a.fancy.link.name would result in mangled output, eg)

translation file:

    a.fancy.link.name: <script src="some-evil-bitcoin-miner-script.js"></script>Click Here!
HTML template:

    <%= link(foo_service_url, t(.a.fancy.link.name)) %>
would produce the final HTML:

    <a href="http://www.foo.com">&lt;script src=&quot;some-evil-script.js&quot;&gt;&lt;/script&gt;Click Here!</a>
> (not that I find that much of a problem -- at least not with my approach where XSS via params is not possible).

That's... hardly the only threat you face if you have a translator feeding you malicious strings

We're not talking full HTML documents here, just strings with an occasional link or word styling or maybe once in a while a bulleted list. But your experience differs from mine then.

Regarding the link, I was more thinking about how your system handles the some.introductory.paragraph translation and how you differentiate between potential HTML in the translation vs HTML in the click_here variable vs HTML in another potential variable containing user input.

> Regarding the link, I was more thinking about how your system handles the some.introductory.paragraph translation and how you differentiate between potential HTML in the translation vs HTML in the click_here variable vs HTML in another potential variable containing user input.

Well, the differentiation is between strings which are dev-authored and parsed during compilation/boot time – which are trusted (untainted) and thus may contain HTML that's rendered directly – and tainted strings which come in at runtime either from user input or from via the translations lookup (among other things), and which can never be rendered without fully HTML escaping (without the code explicitly untainting them, at least, but that would never survive code review because it's profoundly unsafe to do this).

click_here isn't a "real" variable in the source language, it's just something that the translation API can replace during the translation load. To the extent that it can contain HTML, it can do so if and only if it is bound to an untainted string instance during the translation load – binding it to a tainted instance would cause any HTML that gets inserted into there to get fully escaped. "link" being dev-controlled produces untainted strings, but might itself consume a tainted string for its title (and thus escape that while rendering the title as part of outputing its untainted string), etc.

> how you differentiate between potential HTML in the translation

it's very simple: the translation is not trusted and thus can't contain HTML that gets rendered without being fully escaped, and thus looking like garbage. If you really wanted to style something in the middle of the paragraph (which again, effectively never really comes up in my experience) you would have to split the paragraph into 3 keys: everything leading up to the start of a tag, whatever's inside the tag, and everything after the tag.

> Single Source of Truth for certain presentation decisions.

You can't have a single source of truth for presentation decisions in a multilingual product. Different languages have different typographic traditions, will demand different minimum container sizes based on word lengths and maybe this is shocking but they sometimes run in different directions. If you are not integrating the dev, design and localized copy editing roles on your team, your product is going to look like trash except where the primary language of the team is concerned.

Translation can scale for large products, but localization cannot: until further notice, you can only do it the hard way, or the wrong way.

> You can't have a single source of truth for presentation decisions in a multilingual product. Different languages have different typographic traditions, will demand different minimum container sizes based on word lengths and maybe this is shocking but they sometimes run in different directions.

Maybe this is shocking but I'm fluent in a language that is sometimes written veritcally.

"You can't have one single common presentation for every translation" is true in an absolute sense but often not true in practice – eg) we hit most of Europe and North, Central, and South America with ~10 static translations rendered into one common presentational template, none of which run into any of the truly complex layout differences that right-to-left or vertical presentations would bring. We extensively QA all of the languages we do support, and presentation issues are truly pretty damn rare. It's your classic "80% of the result for 20% of the effort" tradeoff.

Now, if you truly do need to localize in every language under the sun then yeah, something like this can make sense, as it gives you maximum flexibility wrt to varying your layout alongside the translation.

But if you have any simpler use-case (eg. supporting just English, Spanish, French and Portuguese will give you an enormous chunk of the planet with minimal overhead, as they have very similar word lengths and presentation requirements) then the approach here is just taking on all of the effort and maintenance overhead of the maximally-complex case when you have absolutely no need to.