| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Aurel1us 4779 days ago
	Short answer: \d includes all the Unicode characters from http://www.fileformat.info/info/unicode/category/Nd/list.htm

3 comments

ars 4779 days ago

Is that actually a good thing? If I'm using \d to validate numbers (for example to check before string to int conversion, or IP address, phone number, or any other use), other unicode digits are not helpful to me.

It's great to support unicode, but I don't think the \d should have been extended this way. Add a \ud or something.

link

Tuna-Fish 4779 days ago

Given that the category is specifically "decimal digit", I think it's good, so long as the number parsing code accepts them all too.

link

dllthomas 4779 days ago

Yes. Assuming that, it's good. I think that assumption is likely to be invalid in many cases, though.

link

rmc 4779 days ago

Yes it's a good thing. There are other places in the world that don't just use ascii. If you want European style numbers just use [0-9]

link

bellbind 4779 days ago

If you use a preg engine you can add the /a modifier which excludes unicode chars from matches.

link

chebucto 4779 days ago

Maybe specify the subset of unicode you're expecting in the headers, and have the compiler do the nitty gritty?

link

wging 4779 days ago

...at least in C# regexes.

link

ars 4779 days ago

Anyone know if this happens in other languages?

link

yahelc 4779 days ago

Doesn't appear to in JavaScript:

    "੧".match(/\d/); //null

(Incidentally, this may explain the finding from http://stackoverflow.com/a/16622773/172322, as to why adding the RegexOptions.ECMAScript flag in the C# code eliminates the performance gap)

link

deskglass 4779 days ago

Nor in python:

print re.match(r'\d','੧')

None

link

wulczer 4779 days ago

it does when using the re.U flag

  re.match(r'\d', u'੧', re.U)
  <_sre.SRE_Match at 0x3070ac0>

  sys.version
  2.7.3 (default, Mar  4 2013, 14:57:34) \n[GCC 4.7.2]

link

tcas 4779 days ago

Also, when using Python 3.2 it seems to be the default behavior

  Python 3.2.3 (default, Oct 19 2012, 20:10:41) 
  [GCC 4.6.3] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import re
  >>> re.match(r'\d', '੧')
  <_sre.SRE_Match object at 0x7f188f6d4850>

link

Falling3 4779 days ago

Yes, but not by default.

link

jrabone 4778 days ago

Not true for Java. Docs even say:

  \d         A digit: [0-9]
  \p{Digit}  A decimal digit: [0-9]

which is actually somewhat depressing. I'd expect the named class to include the full Unicode digit set. It's surprising to see:

  ab1234567890cd matched 1234567890
  ab𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟢cd no match

from code using Pattern.compile("(\\p{Digit}+)");

EDIT: and perhaps more surprising to see in the logs:

  Exception in thread "main" java.lang.NumberFormatException: For input string: "𝟤𝟥𝟦𝟧"
  	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
  	at java.lang.Integer.parseInt(Integer.java:449)

That'll keep someone guessing for a while...

link

xudongz 4779 days ago

Not true for Go

http://play.golang.org/p/ls96RxJxpz

link

nknighthb 4779 days ago

I would be reluctant to rely on this until the Go documentation is clearer on the intended behavior. Right now it's very poorly specified. The regex doc[1] talks about "same general syntax" as Perl, but points to [2], which doesn't seem to understand what it's saying, describing '\d' in terms of its "Perl" meaning, but then saying that it's [0-9].

[1] http://golang.org/pkg/regexp/

[2] https://code.google.com/p/re2/wiki/Syntax

link

laumars 4779 days ago

As a Perl developer that's been making the switch to Go, I've been caught out a few times with Go's no-so-Perl-like regular expression syntax. In fact I wish I knew about your 2nd link before now, because that could have saved me a few hours over recent months.

link

jamesmiller5 4779 days ago

Considering go's vocal support for UTF8 I'm surprised at this behavior and curious to the reason for excluding it.

link

masklinn 4779 days ago

Supporting UTF8 and correctly handling unicode are very, very different beasts. The former is absolutely trivial, the latter is extremely difficult.

Go is vocal about the former, but seems to not give a shit about the latter.

link

snogglethorpe 4779 days ago

I'm not particularly fond of go, but "correctly handling unicode" can be subjective and case-dependent... I think making only minimal guarantees and punting to the application is often the only sane course.

link

nspragmatic 4778 days ago

It happens in Objective-C:

    NSString *pattern = @"\\d", *string = @"੧";
    NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern
                                                                           options:NSRegularExpressionCaseInsensitive
                                                                             error:nil];

    NSUInteger numMatches = [regex numberOfMatchesInString:string
                                                   options:0
                                                     range:NSMakeRange(0, [string length])];

    numMatches ? NSLog(@"%@ found by %@", string, pattern) : NSLog(@"%@ not found", string);

    // 2013-05-20 09:38:42.650 Regexperiment[17848:c07] ੧ found by \d

link

LawnGnome 4779 days ago

Happens in PHP only if you enable Unicode regex handling via the /u modifier and are running libpcre 8.10 or later (which corresponds to PHP 5.3.4 and later, assuming you're using the bundled libpcre): http://3v4l.org/QD3k0

link

bodyfour 4779 days ago

If you're using pcre directly from C code, this is controlled by specifying the PCRE_UCP flag to pcre_compile(). By default, \d and friends only match ASCII characters even if the PCRE_UTF8 flag is set.

link

Falling3 4779 days ago

Exactly what I was thinking.

Doesn't in Ruby:

/\d/.match "੧" #=> nil

link

Argorak 4778 days ago

Just for reference:

  /\p{Digit}/.match "੧" => #<MatchData "੧">

link

cwmma 4779 days ago

All the same speed in JavaScrip http://jsperf.com/regexcwm/2

link

jeltz 4779 days ago

Happens in Perl but not ruby or PostgreSQL.

link

pfedor 4778 days ago

Doesn't happen in Perl for me:

  pfedor@Pawels-iMac:~$ perl -ne 'print "Digit!\n" if /\d/'
  af
  3
  Digit!
  23fa3
  Digit!
  asdf
  ١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹
  ৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫
  ୧୨୩୪୫୬୭୮
  ౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓
  234
  Digit!

(perl from Macports and perl from /usr/bin/perl behave the same in this respect.)

link

xonea 4778 days ago

You have to tell to interpret stdin as UTF-8 (flag -C) - then it works: https://news.ycombinator.com/item?id=5734641

link

pfedor 4778 days ago

Good to know, thanks.

I'd argue that perl gets it right--as the default behavior, this behavior would gravely violate the principle of least surprise, but for the 0.01% of people who want \d to match ੧, there's no harm to making it available as an option you need to specifically request.

link

hkmurakami 4779 days ago

oh wow I had no idea that "full width digits" can actually be handled properly. (U+FF10 ~ U+FF19)

link

coldtea 4779 days ago

Or improperly. If you expect \d to be a shorthand for 0-9, your string can also contain junk.

link