Hacker News new | ask | show | jobs
by LeviticusMB 1442 days ago
My experience is that most translators actually do know basic HTML or can at least translate an english base string containing HTML into their own language without messing it up. CSS would of course not be present, just sematic HTML (or any other kind of "rich text" -- it wouldn't have to be HTML specifically).

I'd argue that one reason rich text translations are rare is because it's such a pain. Just look at any static documentation web site -- styling and links are everywhere. Of course I want that for non-static web sites/apps as well; links to navigate the app, side-bars or popovers with help text and documentation, more links, bulleted lists ...

I'm not sure I understand how your example prevents that HTML threat model you mention, unless the "link" function generates some kinds of magic placeholders that you then replace with HTML in another step you did not mention. If "link" generates an A tag, then you're already trusting the translation with HTML powers anyway (not that I find that much of a problem -- at least not with my approach where XSS via params is not possible).

1 comments

> My experience is that most translators actually do know basic HTML or can at least translate an english base string containing HTML into their own language without messing it up. CSS would of course not be present, just sematic HTML (or any other kind of "rich text" -- it wouldn't have to be HTML specifically).

I don't know what to tell you other than that it is not my experience at all that translation services offer or even accept HTML as a source-format, and if they did they would no doubt command a significant premium over translators who know the languages but lack such tech skills.

And I absolutely wouldn't trust a third party to directly author HTML we were serving anyways. Manual audits of 3rd-party input aren't enough – your tooling should be automatically protecting you from 3rd parties inserting unsanitized HTML (as below)

> I'm not sure I understand how your example prevents that HTML threat model you mention, unless the "link" function generates some kinds of magic placeholders that you then replace with HTML in another step you did not mention. If "link" generates an A tag, then you're already trusting the translation with HTML powers anyway

Good lord, no – you should never be rendering externally controlled strings directly as unescaped HTML, and that includes strings from 3rd party translators.

The lookup function for translation keys produces instances of an "unsanitized" (tainted) string class which is escaped on rendering, so if "link" in this case takes two arguments (the URL, which will become the href, and the text that will get wrapped in the A tag – the text argument will be completely escaped such that attempting to embed HTML in the translation key .a.fancy.link.name would result in mangled output, eg)

translation file:

    a.fancy.link.name: <script src="some-evil-bitcoin-miner-script.js"></script>Click Here!
HTML template:

    <%= link(foo_service_url, t(.a.fancy.link.name)) %>
would produce the final HTML:

    <a href="http://www.foo.com">&lt;script src=&quot;some-evil-script.js&quot;&gt;&lt;/script&gt;Click Here!</a>
> (not that I find that much of a problem -- at least not with my approach where XSS via params is not possible).

That's... hardly the only threat you face if you have a translator feeding you malicious strings

We're not talking full HTML documents here, just strings with an occasional link or word styling or maybe once in a while a bulleted list. But your experience differs from mine then.

Regarding the link, I was more thinking about how your system handles the some.introductory.paragraph translation and how you differentiate between potential HTML in the translation vs HTML in the click_here variable vs HTML in another potential variable containing user input.

> Regarding the link, I was more thinking about how your system handles the some.introductory.paragraph translation and how you differentiate between potential HTML in the translation vs HTML in the click_here variable vs HTML in another potential variable containing user input.

Well, the differentiation is between strings which are dev-authored and parsed during compilation/boot time – which are trusted (untainted) and thus may contain HTML that's rendered directly – and tainted strings which come in at runtime either from user input or from via the translations lookup (among other things), and which can never be rendered without fully HTML escaping (without the code explicitly untainting them, at least, but that would never survive code review because it's profoundly unsafe to do this).

click_here isn't a "real" variable in the source language, it's just something that the translation API can replace during the translation load. To the extent that it can contain HTML, it can do so if and only if it is bound to an untainted string instance during the translation load – binding it to a tainted instance would cause any HTML that gets inserted into there to get fully escaped. "link" being dev-controlled produces untainted strings, but might itself consume a tainted string for its title (and thus escape that while rendering the title as part of outputing its untainted string), etc.

> how you differentiate between potential HTML in the translation

it's very simple: the translation is not trusted and thus can't contain HTML that gets rendered without being fully escaped, and thus looking like garbage. If you really wanted to style something in the middle of the paragraph (which again, effectively never really comes up in my experience) you would have to split the paragraph into 3 keys: everything leading up to the start of a tag, whatever's inside the tag, and everything after the tag.