(Incidentally, this may explain the finding from http://stackoverflow.com/a/16622773/172322, as to why adding the RegexOptions.ECMAScript flag in the C# code eliminates the performance gap)
Also, when using Python 3.2 it seems to be the default behavior
Python 3.2.3 (default, Oct 19 2012, 20:10:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.match(r'\d', '੧')
<_sre.SRE_Match object at 0x7f188f6d4850>
\d A digit: [0-9]
\p{Digit} A decimal digit: [0-9]
which is actually somewhat depressing. I'd expect the named class to include the full Unicode digit set. It's surprising to see:
ab1234567890cd matched 1234567890
ab𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟢cd no match
from code using
Pattern.compile("(\\p{Digit}+)");
EDIT: and perhaps more surprising to see in the logs:
Exception in thread "main" java.lang.NumberFormatException: For input string: "𝟤𝟥𝟦𝟧"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:449)
I would be reluctant to rely on this until the Go documentation is clearer on the intended behavior. Right now it's very poorly specified. The regex doc[1] talks about "same general syntax" as Perl, but points to [2], which doesn't seem to understand what it's saying, describing '\d' in terms of its "Perl" meaning, but then saying that it's [0-9].
As a Perl developer that's been making the switch to Go, I've been caught out a few times with Go's no-so-Perl-like regular expression syntax. In fact I wish I knew about your 2nd link before now, because that could have saved me a few hours over recent months.
I'm not particularly fond of go, but "correctly handling unicode" can be subjective and case-dependent... I think making only minimal guarantees and punting to the application is often the only sane course.
> "correctly handling unicode" can be subjective and case-dependent...
So is correctly handling integers.
> I think making only minimal guarantees and punting to the application is often the only sane course.
That is completely and utterly crazy, the average developer has neither the knowledge nor the resources to make anything but a mess out of it without proper tools and APIs. Even with these (including a complete implementation of the unicode standard and its technical reports) unicode is already complex enough to deal with.
Happens in PHP only if you enable Unicode regex handling via the /u modifier and are running libpcre 8.10 or later (which corresponds to PHP 5.3.4 and later, assuming you're using the bundled libpcre): http://3v4l.org/QD3k0
If you're using pcre directly from C code, this is controlled by specifying the PCRE_UCP flag to pcre_compile(). By default, \d and friends only match ASCII characters even if the PCRE_UTF8 flag is set.
I'd argue that perl gets it right--as the default behavior, this behavior would gravely violate the principle of least surprise, but for the 0.01% of people who want \d to match ੧, there's no harm to making it available as an option you need to specifically request.