Hacker News new | ask | show | jobs
by kijin 4790 days ago
> Nobody has ever sent any of my users a legitimate email with a utf8-encoded subject.

I suppose you and your users only ever communicate with Westerners. Or are all of your Japanese correspondents (for example) kind enough to encode their subjects in EUC-JP instead of UTF-8?

1 comments

You suppose correct; obviously that might change, but for now it's the case that in 20 years of email not one single legitimate message has had a non-7 bit ASCII subject line, whereas there's always plenty of ⋎Іǎḡɾǎ, ѵἲàɠṝà and ѷἰẫǧʀẫ to go around (in the form =?utf-8?Q?=E2=8B=8E=D0=86=C7=8E=E1=B8=A1=C9=BE=C7=8E?= of course)

Pro tip: if anyone is trying to block this shit without blocking legitimate Unicode, you'll be wanting Unicode::Normalize and something like

    utf8::decode($rawSubject);
    my $normalised = NFKD($rawSubject);
    $normalised =~ s/\p{NonspacingMark}//g;
to strip the composing diacritics before you reach for the regexes. Good luck.