Watch out: ɢoogle.com isn’t the same as Google.com

Y	Hacker News new \| ask \| show \| jobs

	Watch out: ɢoogle.com isn’t the same as Google.com (thenextweb.com)
	206 points by lucodibidil 3499 days ago

23 comments

rurban 3498 days ago

What about ‮goog‬le.com which is really <U+202E>goog<U+202C>le.com :)

TR36 bidi spoofs are usually worse than TR39 confusables. Move over with your cursor over it. http://www.unicode.org/reports/tr36/#Bidirectional_Text_Spoo...

That's why browsers or dns tools use libidn, just programming languages not.

link

a3n 3498 days ago

This is strange to me. This is clearly meant, in unicode, to be 'G' that we all know and love. It has uselessly expanded "the alphabet" (to be western-centric) in a confusable way.

Unicode maybe should have been three dimensional, with "concept of G" in the 2D space, and "ways of representing G" behind G, along the third axis. All ways of representing G, whether little capital, capital, lower case, would or at least could equate to conceptual G in the 2D space.

link

hackuser 3498 days ago

> Unicode maybe should have been three dimensional, with "concept of G" in the 2D space, and "ways of representing G" behind G, along the third axis. All ways of representing G, whether little capital, capital, lower case, would or at least could equate to conceptual G in the 2D space.

It brings up interesting, long-standing problems. Which of these count as the same letters?

* Letters in two languages with the same appearance and making the same phonetic sound

* Letters in two languages with the same appearance but making slightly different phonetic sounds. E.g., R in English and French

* Letters in in two languages that are otherwise the same, but one has an accent. Is the accent part of the letter? Separate? Are they really the same letter?

* Letters in two languages with the same appearance but making completely different phonetic sounds.

* Similar (by any property) letters in two related languages; e.g., both Indo-European

* Similar (by any property) letters in two unrelated languages; e.g., French and Vietnamese.

* Letters with the same phonetic sound but different appearances.

* Letters with the same appearance, one is phonetic and one an ideograph

* Letters that are otherwise identical, but alphabetize differently in their respective languages

* EDIT: Forgot a key one; Letters that are otherwise identical, but follow different rules of how they combine with the letters around them (a common issue, though not familiar to English speakers).

* Letters that are in all ways identical but belong in different languages. In which languages code group does the letter belong? One? Both? What if the subset of Unicode supported by an application includes one language but not the other?

etc. etc.

link

vurpo 3498 days ago

> Letters in in two languages that are otherwise the same, but one has an accent. Is the accent part of the letter? Separate? Are they really the same letter?

It gets worse than this. Example: the letters Ä and Ö exist in both Swedish and German (as an example).

In German they are actually counted as the letters A and O with diaereses above them, and they alphabetize together with other instances of the letters A and O, because that's what they are.

In Swedish those are their own letters, which are completely separate from the letters A and O. They get their own place in the alphabet (second-to last and last, respectively), and replacing them with AE and OE is technically not acceptable in Swedish like it is in German (though it's often done anyway, e.g. on airline tickets).

And in Unicode they are represented by the same code-point even though in one language it is a letter, and in the other language it's only a variation on another letter. What a mess.

link

jahewson 3498 days ago

That character is from the phonetic alphabet so it's not the "concept of G", it's the concept of a "voiced uvular stop", which happens to looks visually like G. So what Unicode is doing is separating two conceptually different ideas, exactly as intended.

The cases where Unicode has taken similar looking characters and combined them into one have not been successful, Han Unification for example was widely viewed as a misstep and has caused many problems, such as making it impossible to embed certain Japanese characters in Chinese text without higher-level markup.

link

stevenbedrick 3498 days ago

It actually does do something along those lines, with the "canonical" and "compatible" equivalence rules:

https://en.wikipedia.org/wiki/Unicode_equivalence

As mentioned by others on this thread, the real issue is not with Unicode per se, but rather with the ways that web browsers handle it (or fail to handle it, as the case may be).

link

zokier 3498 days ago

I think it is very much an issue in Unicode that they did not define the NFKD of ɢ to be G. As far as I can tell, the rationale is that ɢ is semantically different because it is used in IPA. I find that pretty weak, considering the ubiquity of smallcaps. Asking browsers to diverge (as far as equivalence goes) from Unicode standards sounds a lot like a failure of Unicode.

link

spullara 3498 days ago

The web browser or DNS?

link

drewmate 3498 days ago

That's a really interesting proposal, but I'm afraid it would be difficult to implement in practice. If this third dimension were actually encoded into the number that represents each character, you'd end up with a lot of wasted bits (since most characters probably wouldn't even need the 3rd dimension, or at least as much of it as the heaviest users.) Another option would be to supplement the metadata that already accompanies Unicode characters (which block it is in, the name of the character/block, etc...) This could be done in practice now, but the information would almost certainly just be ignored if it needed to be looked up in a supplemental table. Furthermore, it's difficult to agree on just about anything in Unicode, and classifying all the characters based on concept seems like a Herculean task for a slow-moving body.

Any ideas for how to accomplish this in practice?

link

a3n 3498 days ago

I'll get to that as soon as I make email secure by design.

link

jahewson 3498 days ago

This already exists in Unicode, it's called "Variation Selectors" and they have their own block and are used to select emoji skin tones amongst other things.

But it would be wrong to use them in this case because an IPA G and the letter G are semantically different things and should not be unified into a single character just because they look similar.

link

Lagged2Death 3498 days ago

The G is part of a block called "IPA extensions." Most of its content is more obviously specialized. This G is a phonetic G.

It's not necessarily the case that any given symbol has a bunch of different Unicode representations; unfortunately G has at least two, though.

https://en.m.wikipedia.org/wiki/IPA_Extensions

http://www.fileformat.info/info/unicode/block/ipa_extensions...

link

donquichotte 3499 days ago

Some time ago I registered http://www.goolge.io/. Still haven't done anything with it, I guess at some point I'll just redirect it to duckduckgo. [EDIT: now it's redirected to duckduckgo.]

This can of course be used in a malicious way. I thought about rebuilding the homepage of the bank Credit Suisse on www.credit-siusse.ch, but that's probably illegal.

link

ergot 3498 days ago

Most browsers should forcibly transcribe this to Punycode[1]:

    https://www.𝙿𝙰𝚈𝙿𝙰𝙻.com/

And yet when I paste this into the latest Firefox it redirects to https://www.paypal.com/

No 301 redirects or anything, the browser just treats it like ASCII, which it is clearly not, it actually happens to be Fullwidth:

https://en.wikipedia.org/wiki/Fullwidth_form

Serious phishing opportunity if you ask me!

[1] https://en.wikipedia.org/wiki/Punycode

link

bazzargh 3498 days ago

Nope. The browser is behaving sensibly, since you can't register that domain. It's applying the same rules that the registrars do.

ICANN require that registries follow RFC3491 and related RFCs for name prep before allowing a name to be registered https://www.icann.org/resources/unthemed-pages/idn-guideline... . What that one does is (among other things) NFKC normalization and case-folding:

    irb(main):016:0> "\ufeff\uff30\uff21\uff39\uff30\uff21\uff2c"
    => "ＰＡＹＰＡＬ"
    irb(main):017:0> "\ufeff\uff30\uff21\uff39\uff30\uff21\uff2c".unicode_normalize(:nfkc).downcase
    => "paypal"

link

2T1Qka0rEiPr 3498 days ago

Interesting. So, out of interest, why is the same not being applied for ɢ? (When I ran it through Python's unidecode I got the roman symbol all the same).

link

bazzargh 3498 days ago

Because 'small capital g' doesn't have a compatibility decomposition to G, but wide letter P does have a compatibility decomposition to 'normal' P. Unicode normalization kills large classes of homograph attacks but by no means all. conventions over mixing scripts from different languages stop some more, but there's no single answer.

link

7Z7 3498 days ago

Doing the "ɢ" conversion here[0], I get

  xn--1na

[0]https://www.punycoder.com/

link

Animats 3498 days ago

The problem is that the RFCs aren't restrictive enough, partly because the IETF doesn't have much authority over registrars. The domain name rules really ought to be something like "one script, plus numbers, in a domain name part". But this runs into such things as the tendency in Japan to mix kanjii with English words. Then there's the whole right-to-left mark business, which has to coexist with left-to-right TLDs.

link

ergot 3498 days ago

So if I mix ASCII with obscure UTF8 characters like the domain in OP's post I can register it then?

Something like www.ｐａｙpal.com --> www.n--pal-n76secrc.com

link

bazzargh 3498 days ago

No. When you apply NFKC normalization to that string, you get just 'paypal', so Paypal have already registered the result. You can try that here: http://mct.verisign-grs.com/ - notice how the output is not the same as some online converters based on punycode.js, because that doesn't have nameprep support https://github.com/bestiejs/punycode.js/issues/40

link

arm 3498 days ago

Those characters are not fullwidth.

This:

www.ｐａｙｐａｌ.com

or this:

www.ＰＡＹＰＡＬ.com

would be fullwidth.

What you actually posted are characters in the Mathematical Alphanumeric Symbols block. Specifically:

𝙿 — U+1D67F MATHEMATICAL MONOSPACE CAPITAL P

𝙰 — U+1D670 MATHEMATICAL MONOSPACE CAPITAL A

𝚈 — U+1D688 MATHEMATICAL MONOSPACE CAPITAL Y

𝙻 — U+1D67B MATHEMATICAL MONOSPACE CAPITAL L

link

thefreeman 3498 days ago

How is that a phishing opportunity if it redirects you to the real website?

link

pbhjpbhj 3498 days ago

That looks a lot like the sort of trademark use that authorities have deemed infringing, I'd expect the registrar to "recover" that unless you've got a clear explanation (like Goolge is your name, even then ... (remember Nissan, Mike Rowe, etc.)).

link

75j 3498 days ago

I registered http://www.4ppl3.com a while back. No potential for abuse really, but I just thought it was fun to have a l33t-speak version of the domain name of one of the world's most litigious companies.

That said, I haven't done anything with it, and I'm not a domain squatter, so if anyone wants it I can hook you up!

link

drzaiusapelord 3498 days ago

l33t-speak is so far off my radar that I was wondering if there was some hot startup called '4 people 3' or somesuch. I doubt anyone at Apple remotely cares about l33t-speak from a branding perspective.

link

erelde 3498 days ago

To be fair, I don't think I know anyone who "cares" about l33t speak.

H4xx0r j0k3s is all it is.

I never seen anyone go out of their way to defend or use leet speak all day long.

link

Symbiote 3498 days ago

In some countries, it's the best that can be done for vanity car registration plates.

e.g. a site selling them in the UK is promoting "JO66 ERX", which is probably supposed to be read as "Jogger X". Current bid £750, for some reason.

link

Entangled 3499 days ago

Web browsers should have an option to show non-ascii chars in urls in red.

link

mcv 3498 days ago

This would be a great solution. Allowing unicode characters in domain names is just inviting trouble. I understand that people with non-Latin scripts want domain names in their own language and alphabet, but there are way too many unicode characters that will confuse people about legitimate-looking domain names.

Showing non-ascii in red would be an easy solution for everybody.

link

elihu 3498 days ago

It seems like a reasonable compromise would be to allow domain names in non-Latin languages as long as the entire name is in the character set of a specific language. So, if your name is in English, that's fine. If it's in, say, Cyrillic, that's fine too. But if you mix English and Cyrillic characters, that's not allowed. It wouldn't necessarily eliminate all name look-alikes, but it would get rid of most of them.

link

jcranmer 3498 days ago

That's supposed to be one of the rules at the registrar level, but it's one that gets ignored in practice.

I have heard proposals that mixed-script IDNs get converted to punycode in URL display, but I don't know if any browser has fully implemented that yet.

link

Klathmon 3498 days ago

Wouldn't a good compromise to be to somehow highlight any characters that are outside of ASCII and the character set of the language you are using in the browser/os?

link

brainfire 3498 days ago

Everybody who isn't colorblind anyway.

link

entropy_ 3498 days ago

Use blue, the most common types of color blindness are red-green issues(there's a really tiny percentage that doesn't perceive colors at all but really really tiny. And other than those, nobody has trouble with blue)

Source: I'm colorblind(protanope) and red would definitely be an issue. Android studio, for example, is really annoying for me because the particular red they use for errors is very hard to distinguish from black

link

zrm 3498 days ago

That kind of violates the intuitive "red/orange/yellow is alarm, blue/green/black is expected" notion though. What you could do is put ASCII characters in blue and non-ASCII characters in red.

link

jessriedel 3498 days ago

Isn't this a general argument against ever using red as a warning? Seems to prove too much.

Especially in this case, where there is unlikely to be a specialized class of scammers who go phishing only for people with red-green colorblindness. So long as browsers implement a feature that stops the phishing in 99% of cases, the scammers will try something else.

link

brainfire 3498 days ago

It's an argument against using red as the only warning sign.

Compare to Chrome's https indicator- it turns the "https://" part of the URL green (which I can barely distinguish as different, so it is useless to me) and adds a padlock icon.

Colorblind-friendly graphs might use both color and symbols to distinguish elements.

link

lb1lf 3498 days ago

-Tritanopes may beg to differ (regarding the blue not being an issue, that is.)

Significantly less common than red/green deficiency, though - I only know of one more on the island I live on (pop. 15,000 or so)

link

entropy_ 3498 days ago

Wasn't aware of that one. Significantly fewer people affected though(I think red-green is something like 10% of males)

link

vbezhenar 3498 days ago

> people with non-Latin scripts want domain names in their own language

I've yet to see a useful site with Cyrillic domain. Theoretically it sounds good, but practically everyone still uses Latin domains. May be it'll change with time.

link

kalleboo 3497 days ago

Same thing here in Japan with Kanji domains. Although I believe Windows XP usage here is still non-insignificant...

link

a3n 3498 days ago

Don't even show the suspect URL, show "THIS MIGHT BE A SCAM", with some kind of hover over showing the URL, and some way to click to more information.

link

witty_username 3498 days ago

Why?

Non-latin alphabet domain names do have legitimate uses, although they are very rarely used.

link

tinus_hn 3498 days ago

Except by a third of all people who live in China and India. Not everyone speaks a language that is representable in the latin alphabet. In fact, a very large percentage of people do not.

link

saurik 3498 days ago

And it is then worth noting that as it stands, the attitudes of western developers with respect to text input and name lookup has so horribly screwed the Chinese with respect to domain names that they started using numbers instead of letters for their major web properties.

https://newrepublic.com/article/117608/chinese-number-websit...

link

witty_username 3497 days ago

I live in India. I have never seen a non-Latin alphabet domain, except when I opened <some hindi word>.<tld> and <poop emoji>.com just out of curiosity. Could you show me some non Latin alphabet domain names that are used?

I am not claiming that everyone speaks a language that is representable in the Latin alphabet.

link

hx87 3498 days ago

China and India don't pose a problem since Pinyin uses standard ASCII characters and neither Chinese characters nor Brahmic scripts have any symbols that resemble ASCII characters.

link

a3n 3498 days ago

For the same reason that my email client occasionally tells me "this may be a scam," even though sometimes it's not and I act accordingly. Based on whatever criteria it's using, the data received has a somewhat higher chance of being illegitimate.

We as (technical) humans can recognize (hence this discussion) that the use of this uncommon G is meant to mislead you into thinking you're going to Google, when in fact you're going to Hell. I'd like to be warned of that possibility.

In this case, the extremely oversimplified algorithm might be "does the domain, as filtered down to canonical characters, represent one of the top five destination domains, yet go somewhere else if not canonicalized?"

link

pvdebbe 3498 days ago

The Chinese will be thrilled!

link

hx87 3498 days ago

Chinese people will be fine since all Chinese URLs are either ASCII compliant or use Chinese characters, which can't be confused with any ASCII characters.

Russians would definitely be pissed though.

link

pvdebbe 3498 days ago

To my understanding the unicode standard encodes an ASCII transliteration of an Unicode symbol to itself, but what about typographical similances? Wouldn't that be a hard problem? Perhaps there are two unicode characters that look exactly the same (using a given typeface) but have different transliterations. Or vice versa - two totally different looking characters share transliterations and gave false alarms.

link

bonzini 3498 days ago

Just handle .рф domains (and the Serbian Cyrillic ccTLD) specially.

link

deegles 3498 days ago

It's not a great solution since it requires knowing the difference between ASCII and Unicode... I would argue that a user who is vulnerable to falling for unicode characters in domain names won't have that knowledge.

link

VMG 3498 days ago

"Neat, Sparkasse now even has a colored domain name! Now where did I put my TAN device again?"

- Average User

link

angry-hacker 3498 days ago

Cool, my head of marketing department wants a RED domain.

link

koliber 3498 days ago

Chinese character domains would be shown in red letters. I think it's a good choice of color. :)

link

hatsunearu 3498 days ago

What about websites without Chinese characters? I know in Asia, having red colored names is kind of offensive (evokes of the Reaper's 'hit list').

Would be annoying if [name].me or whatever is red!

link

booleandilemma 3498 days ago

Black is associated with death in western culture and no one seems to be bothered.

link

koliber 3498 days ago

I agree with you completely. Cultural sensitivity is a difficult thing when you have a global audience for your product or service.

Maybe as part of the locale configuration, in addition to number and date format, people should pick a friendly and an offensive color! :)

link

Dove 3498 days ago

No reason for me to try to pick out specific characters. I won't notice. Plus, it won't work for zero width characters, and I might miss it for really tiny ones.

Give me a popup warning explaining the problem when I try to visit the site, same as I get for certificate problems.

link

nailer 3499 days ago

They should already only show punycode for characters inside your locale.

'ɢ' is obviously an exception since (I imagine) it's considered to be in your locale, but maybe it shouldn't be.

link

Freak_NL 3498 days ago

That would be very confusing for multilingual users. Just because my OS is configured to use a certain locale, doesn't mean I don't read text in scripts not considered part of it.

link

nailer 3498 days ago

Your OS (and browser) support multiple languages, so if you speak a language they should in the list.

link

Freak_NL 3498 days ago

They are of course, but if you use that list instead of a single locale, you end up with a solution that only highlights 'strange' characters when they are not part of your language/locale set. So for someone who speaks only Latin character based languages you could highlight all Cyrillic characters, but for someone who speaks Russian you still have the original problem (it's not as if you can just highlight all Latin characters in their case!).

link

marcosdumay 3498 days ago

It should display each code page with a different color. That would make the schema useful for non-english speaking people too.

link

rbanffy 3498 days ago

\o/ rainbow URLs ftw!

link

vurpo 3498 days ago

It would need a more complex solution than that. For example, this is the website of a local bus company where I live:

http://åbus.fi

The characters are from the Latin character set, but non-ASCII. Highlighting the Å in red would look pretty confusing. And in many countries you want the entire domain name written in non-ASCII characters, depending on the language. E.g. websites in Russia, China, India, etc...

link

blacktulip 3499 days ago

And on by default

link

cjrd 3498 days ago

Proud owner of http://gïthub.com checking in...

link

y4mi 3497 days ago

the visiblend screenshot on your projects page is dead because of an unresolveable dns href.

the screenshot on your kmap repo[1] was dead as well, until i actually opened it. i'm guessing the jpg isnt generated until somebody clicks on it.

enough cyberstalking for me this evening :p

[1] https://github.com/cjrd/kmap

link

yamaneko 3498 days ago

Awesome site, by the way. I'm just checking out your tutorial on LDA.

link

TazeTSchnitzel 3498 days ago

https://en.wikipedia.org/wiki/IDN_homograph_attack

link

talideon 3498 days ago

Most registries did a better job on constructing their IDN tables than Verisign did. :-(

link

orbitur 3498 days ago

This is something that's been bugging me for years.

Why are there multiple representations of alphabet characters in Unicode? It seems reasonable to include accent marks, but what's the benefit in having a Cyrillic 'o' alongside a standard 'o' or the 2 or 3 other ASCII-lookalike sets of characters?

link

alisey 3498 days ago

The most important reason is semantics. If "O" and "0" look alike in a certain font, should we use the same character code for both? No, because they have different meaning.

Here are some contexts in which this semantic difference is important: search (compare search results for "cop" and "сор"), alphabetical sorting, text-to-speech, spellchecking, case conversion ("ATOM" -> "atom", but "АТОМ" -> "атом", note the difference between t-т and m-м).

link

jstimpfle 3498 days ago

There will never be agreement what's the set of distinct characters (also, what characters should be included, bitcoin logo, facebook logo?)). I see Unicode as a necessary evil. Due to its complexity most applications should treat Unicode text as black boxes.

I never rely on Unicode for computation. When receiving Unicode I always make sure it's in the ASCII range. It could be argued that there should never have been Unicode domain names but I guess Western people are very lucky that ASCII includes most of their characters...

link

user5994461 3498 days ago

> When receiving Unicode I always make sure it's in the ASCII range. [...] Western people are very lucky that ASCII includes most of their characters...

Please don't spread the myth of Western languages being encodable in ASCII, and don't pretend to support Unicode (or anything else than English) if you filter everything to ASCII.

The _only_ Western language that is encodable in ASCII is English.

Corollary: English is the only language that can be encoded in ASCII.

The other western languages have endless issues with text being encoded/stripped down to ASCII. e.g. French, Spanish, Portuguese, German...

link

jstimpfle 3498 days ago

As a german I can attest that I can very well converse (e.g Email) in ASCII. Although it's convenient to use Umlauts, which I do. And I also agree that French or Spanish might be less convenient.

But that was not my point. The point was about identifiers, such as DNS names.

link

kalleboo 3498 days ago

One goal of Unicode has been lossless round-tripping between legacy encodings (to encourage adoption). If such an encoding contains both Latin and Cyrillic, they must be separate to enable that.

link

3pt14159 3498 days ago

The (seemingly obstinate) answer is that they are different characters. The Russian H sounds like an N in English.

If you're transcribing a conversation at the UN and there is a mix of different languages the fact that "Het" is transcribed as a latin character set is information. Het may be a southern American group of people, or it could just be a Russian dude saying "no", even if it looks the same.

I understand that we're still burdened by intralanguage homonyms, but I appreciate the fact that it isn't complicated further.

link

leeoniya 3498 days ago

the font metrics and hinting/kerning are likely language or dialect-specific

link

kps 3498 days ago

Compatibility with ISO8859. For example, for Cyrillic, the first 128 characters U+40xx match ISO8859-5.

link

ergot 3499 days ago

For me it just redirects to

    http://money.get.away.get.a.good.job.with.jack.ilovevitaly.com

The actual domain is http://xn--oogle-wmc.com/

Which is an Internationalized domain name[1] in punycode transcription

[1] https://en.wikipedia.org/wiki/Internationalized_domain_name

The G in question here is

https://en.wiktionary.org/wiki/%C9%A2

http://charcod.es/#%C9%A2/610

link

underyx 3499 days ago

>ilovevitaly.com

This Vitaly guy…

I got tons of referral header spam (that shows up in e.g. Google Analytics) for all sorts of social media buttons and EU cookie law scare tactic sites. And then there was Vitaly who just spammed me with ilovevitaly.com, which if I recall correctly actually was a site about himself at the time.

link

ergot 3498 days ago

Wow what an odd site

link

cdubzzz 3498 days ago

Interesting, this domain now redirects to:

    http://money.get.away.get.a.good.job.with.more.pay.and.you.are.okay.money.it.is.a.gas.grab.that.cash.with.both.hands.and.make.a.stash.new.car.caviar.four.star.daydream.think.i.ll.buy.me.a.football.team.money.get.back.i.am.alright.jack.ilovevitaly.com/#.keep.off.my.stack.money.it.is.a.hit.do.not.give.me.that.do.goody.good.bullshit.i.am.in.the.hi.fidelity.first.class.travelling.set.and.i.think.i.need.a.lear.jet.money.it.is.a.secret.%C9%A2oogle.com/#.share.it.fairly.but.dont.take.a.slice.of.my.pie.money.so.they.say.is.the.root.of.all.evil.today.but.if.you.ask.for.a.rise.it%27s.no.surprise.that.they.are.giving.none.and.secret.%C9%A2oogle.com

link

Kenji 3498 days ago

Unicode URLs are the devil. Too many indistinguishable characters. URLs should stay full ASCII imho. And I say that as someone whose language requires non-ASCII symbols.

Or, in Bruce Schneier's words: "Unicode is just too complex to ever be secure."

link

rurban 3498 days ago

But think about the poor underrepresented folks using foreign character sets?

You really need to support this 'sub café {} café()' => Undefined subroutine café in your friendly and social programming language, otherwise you will be accused of discrimination. Of course the two é are not normalized.

Which unicode-friendly language does really check for mixed script confusables? Java only is my guess. Even perl6 falls into this trap.

http://unicode.org/reports/tr39/#Mixed_Script_Confusables

link

palunon 3498 days ago

When it is just accents, it's ok. But when your users have a language that uses à radically different alphabet, sometimes they can't even read Latin transliterations.

link

rurban 3494 days ago

agree. but then you need to declare your exoting encoding somehow, such as in perl via use encoding 'greek'; and then the parser does not need to guess about mixed scripts encodings on every identifier. there's only latin and greek valid, everything else invalid.

how many languages even check for mixed script confusables? for dynamic languages this check is much too expensive, but they are leading the "good cause", allowing everything, and checking nothing.

link

underyx 3498 days ago

It was a pretty nice surprise that when sending this URL in Slack it was automatically converted to `xn--oogle-wmc.com`.

link

Fiahil 3498 days ago

Slack is not doing anything. It's Google chrome filling up your clipboard with the "extended" version of the url.

link

underyx 3498 days ago

But when I paste it in the Slack message box it shows the ɢoogle.com version.

link

pvdebbe 3498 days ago

I haven't used slack, but I think both are doing the best practices around there: Chrome copies the punycoded URL to clipboard, Slack will decode pasted punycode-URLs into a nicer presentation.

link

seagreen 3498 days ago

The fact that we need application-specific security measures against this just emphasizes the problem. There are a lot of applications.

link

SamWhited 3498 days ago

There has been talk at the IETF of redefining IDNA2008 (the current way you prevent issues like this) in terms of the PRECIS framework (RFC 7564). This wouldn't exactly "solve" the problem, but it would mean that IDNA could be more agile with respect to Unicode versions and would make it easier to react to new problems, new confusable characters, etc. as they happen.

link

vbezhenar 3498 days ago

What about Googlé.com and infinite number of other variations?

link

StavrosK 3498 days ago

Why is everyone thinking so small? What about https://www.goоgle.com?

Or how about the word "gullible" isn't in the dictionary?

http://www.dictionary.com/browse/gulliblе

link

tlrobinson 3498 days ago

Not sure why you're getting downvited, people seem to have missed your clever use of the Cyrillic "o".

link

amelius 3498 days ago

Why is the Cyrillic "o" even a separate glyph/charcode, even though it looks like a regular "o"?

link

ozim 3498 days ago

Probably because it is rotated 360°.

link

mamadrood 3498 days ago

Because it looks like an "o" in Verdana, but it could look different than the "o" in an other font.

link

koliber 3498 days ago

Would it be possible to register a .xn--cm-fmc TLD and have a .cоm registry all of your own?

link

bmmayer1 3498 days ago

Stupid question, how did you do that? What characters are you using?

link

freshyill 3498 days ago

I frequently have to deal with lots of scientific, mathematical, and many other unusual characters.

I use http://unicode-table.com to help figure out what's what. The official Unicode specifications[1] is impenetrable, and it's really hard to deal with.

[1] http://www.unicode.org/Public/UCD/latest/

link

vbezhenar 3498 days ago

Second "o" is in fact Cyrillic "о" which looks indistinguishable from Latin "o" (unless you use some weird font without Cyrillic letters).

link

vbezhenar 3498 days ago

I think, it's impossible to register this domain.

link

StavrosK 3498 days ago

Considering it's already registered, I'd say it's possible:

https://whois.domaintools.com/xn--gogle-kye.com

link

vbezhenar 3498 days ago

Thanks, I was wrong. That's even worse than I thought.

link

StavrosK 3498 days ago

It looks like Google got all lookalikes, because I just tried with a Greek ο and that's also registered by them.

link

joncrocks 3498 days ago

I believe now that browsers have support for non-ascii URLs, each of them have schemes for anti-phishing.

See https://www.w3.org/International/articles/idn-and-iri/

and https://wiki.mozilla.org/IDN_Display_Algorithm

plus http://www.chromium.org/developers/design-documents/idn-in-g...

link

77pt77 3498 days ago

Browsers have supported this for almost a decade.

link

hannele 3498 days ago

Ahh, the old classic, PayPaI: https://en.wikipedia.org/wiki/PayPaI (uppercase 'i')

link

alessioalex 3498 days ago

This just redirects me to http://xn--oogle-wmc.com/ so I know it's not the real google (using Chrome).

link

cesis 3498 days ago

Why Google analytics isn't filtering out this referral spam?

link

akerro 3498 days ago

It's literally not their job to filter referrals... they do the opposite, they collect referrals.

link

jahewson 3498 days ago

Browsers already blacklist many visually similar characters, it seems that the IPA characters need to be added to that list.

link

chaz6 3498 days ago

I thought there were supposed to be registry rules preventing similar looking names to be registered as an idna. I guess not.

link

shshhdhs 3498 days ago

I believe they aren't preventative measures, but responsive. So if Google contacts ICANN, then they may do something about it

link

darkr 3498 days ago

Some registries do this automatically. Some don't.

link

talideon 3498 days ago

Yes and no. One of the problems is that Verisign's handling of IDNs wasn't exactly the best conceived, which left them with silly IDN codepoint tables like this: https://www.iana.org/domains/idn-tables/tables/com_latn_1.2....

link

Programmatic 3498 days ago

I'm not sure how feasible this is, but wouldn't it make sense for .com/.net/etc to be latin alphabet only and allow other domains to be localized with unicode? I wouldn't really have a problem with 新浪首页.cn, and I doubt I would confuse ɢoogle.ru or whatever with google.com

link

barkingcat 3498 days ago

That defeats the purpose of an internationalized dns system.

The whole point of getting unicode into domain names is so we can have 新浪首页.com so that it's no longer a latin alphabet centric system.

link

Programmatic 3498 days ago

Doesn't that yield a whole class of problems though that we're trying to solve with obtuse solutions such as "let's make that character set in red so people don't get phished"? How is that any more international and/or easy to use?

It seems that putting the allowed character set into the tld would be a pretty user-friendly way of doing that.

Edit: As an added bonus, tlds are centrally managed, and are already western/latin encoded. So why not customize it with a localized abbreviation for the language or tld type?

link

hyperhopper 3498 days ago

One is a matter of international standardization of a protocol. Another is a matter of client side security for a certain type of user.

link

Roboprog 3498 days ago

Cool! I want a cool non-alpha unicode domain. I guess "square-root" is already taken, but there must be some cool domains left (even though nobody can actually type them in).

Actually, some of these would probably be nice aliases for some math / science oriented sites.

E.g. - .com

link

Roboprog 3498 days ago

Meh. Markup ate my "radioactive pie" (9762 dec / 2622 hex) symbol :-(

link

hannele 3498 days ago

I'm curious, why is it allowed to register domain names with mixed character sets? I am behind allowing Unicode characters in domain names for the obvious reasons, but are there compelling use cases for allowing them to be mixed?

link

klodolph 3498 days ago

Technically, Unicode is only one character set. If you want to disallow mixing, you have to disallow it on some other basis, like script. There are many edge cases to consider, though, and many legitimate reasons to mix scripts.

link

reacweb 3498 days ago

Maybe browser should have a security option to whitelist characters in URL. When a URL uses another character, there would be popups with explanations and choices.

link

transfire 3498 days ago

Oh, you mean Unicode Sucks(TM)? Yes. Yes it does.

link