Hacker News new | ask | show | jobs
by jhgb 1689 days ago
> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

That actually strikes me as very desirable. (Especially in light of the old maxim that "programs must be written for people to read, and only incidentally for machines to execute".)

2 comments

Those Unicode characters aren't just there for show. They're part of real scripts that real people use; it would be annoying for people using those scripts.
I'm fairly sure this could be arranged for. As in, if there's too many of them belonging to the character set of a particular language, then it's very likely that it's simply a text in that language. But random characters in the middle of ASCII identifiers are probably not something that you want.
Yeah I'm not opposed to adding highlighting to them, and we are investigating how to do it, but it was less clear-cut than the bidi characters (which are totally invisible when rendered). I think we'll want to make it a bit more configurable and probably a separate option to the one which highlights the bidi characters.
Exactly. When we were adding support for non-ASCII identifiers to Rust, and thinking about homoglyphs and confusable characters, we needed to evaluate the tradeoffs between catching such characters and inconveniencing the speakers of various languages who want to write Rust in their language.
This type of attack isn't new. I can't recall the names but there are afair multiple C/C++ coding standards that limit everything to ASCII to avoid precisely this attack, but also others with visually similar but nonequivalent names.
Yes, and they should be in well annotated/marked string/data sections, not in logic code.
Latin C and Cyrillic С aren't the same letter. The latter is actually an "s". It would be a pain in the ass to work with strings if those Cyrillic letters that look like their Latin counterparts reused their codepoints. Imagine having to convert "M" to lowercase. Would that return "m" or "м"? Same for "H", "h" or "н"?

And, actually, there was some really really cursed Soviet encoding that did this to save bits. The Russian railway company still uses it[1] to this day.

[1] https://habr.com/ru/post/547820/

> there was some really really cursed Soviet encoding

I know at least 10 stories that start like this

> Latin C and Cyrillic С aren't the same letter.

Well, as a moderately old Czech, I'm somewhat familiar with Cyrillic. They kind of used to force it on us in schools.