| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jhgb 1689 days ago
	> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet. That actually strikes me as very desirable. (Especially in light of the old maxim that "programs must be written for people to read, and only incidentally for machines to execute".)

2 comments

wizzwizz4 1689 days ago

Those Unicode characters aren't just there for show. They're part of real scripts that real people use; it would be annoying for people using those scripts.

link

jhgb 1689 days ago

I'm fairly sure this could be arranged for. As in, if there's too many of them belonging to the character set of a particular language, then it's very likely that it's simply a text in that language. But random characters in the middle of ASCII identifiers are probably not something that you want.

link

robotmay 1689 days ago

Yeah I'm not opposed to adding highlighting to them, and we are investigating how to do it, but it was less clear-cut than the bidi characters (which are totally invisible when rendered). I think we'll want to make it a bit more configurable and probably a separate option to the one which highlights the bidi characters.

link

JoshTriplett 1689 days ago

Exactly. When we were adding support for non-ASCII identifiers to Rust, and thinking about homoglyphs and confusable characters, we needed to evaluate the tradeoffs between catching such characters and inconveniencing the speakers of various languages who want to write Rust in their language.

link

R0b0t1 1689 days ago

This type of attack isn't new. I can't recall the names but there are afair multiple C/C++ coding standards that limit everything to ASCII to avoid precisely this attack, but also others with visually similar but nonequivalent names.

link

pas 1689 days ago

Yes, and they should be in well annotated/marked string/data sections, not in logic code.

link

grishka 1689 days ago

Latin C and Cyrillic С aren't the same letter. The latter is actually an "s". It would be a pain in the ass to work with strings if those Cyrillic letters that look like their Latin counterparts reused their codepoints. Imagine having to convert "M" to lowercase. Would that return "m" or "м"? Same for "H", "h" or "н"?

And, actually, there was some really really cursed Soviet encoding that did this to save bits. The Russian railway company still uses it[1] to this day.

[1] https://habr.com/ru/post/547820/

link

gambas99 1689 days ago

> there was some really really cursed Soviet encoding

I know at least 10 stories that start like this

link

jhgb 1689 days ago

> Latin C and Cyrillic С aren't the same letter.

Well, as a moderately old Czech, I'm somewhat familiar with Cyrillic. They kind of used to force it on us in schools.

link