| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by deathanatos 4394 days ago

A regex only seems to take ~1µs.

  In [7]: iso_regex = re.compile('(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}(?:\\.?\\d+))')

  In [8]: %timeit iso_regex.match('2014-01-09T21:48:00.921000')
  1000000 loops, best of 3: 1.05 µs per loop

But hey, once it's written in C, why go back?

I'm missing the timezone, but the OP left that out, so I did too. For comparison, dateutil's parse takes ~76µs for me. Kinda makes me wonder why aniso8601 is so slow. (It's also missing a few other things, depending on if you count all the non-time forms as valid input.)

That said, cool! I might use this. One of the things that makes dateutil's parse slower is that it'll parse more than just ISO-8601: it parses many things that look like dates, including some very non-intuitive ones that have caused "bugs"¹. Usually in APIs, its "dates are always ISO-8601", and all I really need is an ISO-8601 parser. While I appreciate the theory behind "be liberal in what you accept", sometimes, I'd rather error out than to build expectations that sending garbage — er, stuff that requires a complicated parse algorithm that I don't really understand — is okay.

¹dateutil.parser.parse('') is midnight of the current date. Why, I don't know. Also, dateutil.parser.parse('noon') is "TypeError: 'NoneType' object is not iterable".

3 comments

ajanuary 4394 days ago

The library has the following features your regex is missing:

* Every part from month onwards is optional

* Separator characters are optional

* Date/time separator can be a space as well as T

* Timezone information

* Parsing the strings into numbers

* Actually creates a datetime object

I expect adding all of those will bump up the time a bit.

link

ajanuary 4394 days ago

I'm not much of a regex wizard, but I tried to add all the features listed other than parsing the result and creating the datetime object.

    iso_regex = re.compile('([0-9]{4})-?([0-9]{1,2})(?:-?([0-9]{1,2})(?:[T ]([0-9]{1,2})(?::?([0-9]{1,2})(?::?([0-9]{1,2}(?:\\.?[0-9]+)?))?(?:(Z)|([+-][0-9]{1,2}):?([0-9]{1,2})))?)?)?')

It seems like it performs quite a bit worse than the library, which creates the full object.

    In [82]: %timeit ciso8601.parse_datetime('2014-01-09T21:48:00.921000')
    1000000 loops, best of 3: 368 ns per loop

    In [83]: %timeit iso_regex.match('2014-01-09T21:48:00.921000')
    100000 loops, best of 3: 9.72 µs per loop

In the interest of intellectual pursuit, is there anything that can be done to the regex to speed it up?

link

randlet 4394 days ago

Note you still need to convert your regex match to a datetime object which is likely going to add some significant overhead.

link

thomas-st 4394 days ago

Good idea with the regex, haven't tried it. That being said, you didn't take the time into account to construct a datetime object, let alone attach time zone information. ciso8601 supports time zones using pytz' FixedOffset and UTC classes, see https://github.com/elasticsales/ciso8601 for additional benchmarks. There's potential for further speedup by using a tzoffset subclass written in C, but in our cases all dates were UTC anyway and so we didn't need the time zone.

link