Hacker News new | ask | show | jobs
by jrabone 4789 days ago
You suppose correct; obviously that might change, but for now it's the case that in 20 years of email not one single legitimate message has had a non-7 bit ASCII subject line, whereas there's always plenty of ⋎Іǎḡɾǎ, ѵἲàɠṝà and ѷἰẫǧʀẫ to go around (in the form =?utf-8?Q?=E2=8B=8E=D0=86=C7=8E=E1=B8=A1=C9=BE=C7=8E?= of course)

Pro tip: if anyone is trying to block this shit without blocking legitimate Unicode, you'll be wanting Unicode::Normalize and something like

    utf8::decode($rawSubject);
    my $normalised = NFKD($rawSubject);
    $normalised =~ s/\p{NonspacingMark}//g;
to strip the composing diacritics before you reach for the regexes. Good luck.