| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gukoff 1962 days ago

With PyO3, I built the library to parse datetimes 10x faster than `datetime.strptime` in just a few lines of code: https://github.com/gukoff/dtparse

It just calls the Rust's chrono library that does the parsing and wraps the result in a Python object. You can do it for any Rust library, it's very, very easy!

The only slightly complicated part is the distribution. You need to use https://github.com/PyO3/maturin or https://github.com/PyO3/setuptools-rust, and of course, you need to have Rust installed on the wheel-building machine.

Feel free to use this repo as a reference if you want to build a similar thing. The code is commented, and there's a working GitHub action that builds the wheels for all platforms and uploads them to PyPi: https://github.com/gukoff/dtparse/tree/master/.github/workfl...

6 comments

japhyr 1962 days ago

I was surprised to find out how slow strptime() can be. I was working on a data-focused project that was finally starting to slow down from the growing volume of data. I was looking at river heights over time, and once I hit about 140,000 data points the project got slow enough to make some profiling and optimization worthwhile. I was quite surprised to find it was spending more than two full seconds just running strptime(), out of a total execution time of around 15 seconds.

I ended up looking at a bunch of different ways of processing timestamps in Python: strptime(), string parsing, regex, datetime.isoformat(), NumPy, Pandas, and more. I got a 46x speedup using datetime.isoformat(). Other approaches got anywhere from 4x to 40x speedup, and a couple approaches were an order of magnitude slower than strptime().

My takeaway was there's no substitute for profiling the actual code you're running, and focusing on the specific bottlenecks in your own project. I wrote this up in a blog post if anyone's interested, "What's faster than strptime()?"

https://ehmatthes.com/blog/faster_than_strptime/

Rotareti 1962 days ago

This is awesome, thanks for sharing! I think this should be added to the PyO3 examples list :)

https://github.com/PyO3/pyo3#examples

mrcarruthers 1962 days ago

how does it compare against ciso8601 perf-wise? https://pypi.org/project/ciso8601/

to be fair ciso8601 only parses iso8601 datetimes, but that's enough for 90%+ of my use cases.

gukoff 1961 days ago

ciso8601 is blazingly fast, and also its wall time is very stable. By all means, use ciso8601 if the format allows :)

On my machine, ciso8601 always runs in 240ns, and the Rust lib median time is 1250ns.

You can run a benchcmark too! Just call pytest, and it will generate an .svg report: https://github.com/gukoff/dtparse/blob/master/tests/test_per... (you'll need to pip install ciso8601 pytest pytest-benchmark[histogram])

throwaway894345 1962 days ago

I'm very curious to hear the use case for which date time parsing was the bottleneck! Also, I'm surprised that the overhead of calling across the language boundary didn't dwarf the gains from parsing...

gukoff 1961 days ago

One of the components in our project was churning through thousands of JSONs per second - deserializing, transforming and serializing them.

These JSONs represented the flight information. They included multiple datetimes, such as the scheduled departure/arrival time and the real departure/arrival time of a flight.

The first bottleneck was JSON deserializarion/serializarion. At that time we solved it with ujson, and now there's the even more performant orjson.

The second bottleneck happened to be datetime deserializarion. And we solved it with ciso8601 - luckily, these datetimes were in ISO8601. But this bottleneck later repeatedly occured in the other components and became an inspiration to write dtparse :)

sillysaurusx 1961 days ago

Wow, orjson is amazing. It even serializes numpy arrays. Thanks!

delduca 1961 days ago

`pysimdjson` is even better!

oblvious-earth 1962 days ago

I've had this situation a few times. Most recently transforming large (1-50 GB) CSV files in to a format that can be digested by a proprietary bulk DB loader.

Because our problem was just about reformatting we ended up reading the CSVs in binary mode and using struct to extract the relevant values from the date time fields. But if we needed to do actual date logic something like this would perhaps be useful (but there other fast date time libraries out there, I've been a fan of pendulum for some tasks).

throwaway894345 1962 days ago

That makes sense, but I have a hard time believing the approach of calling into a date time parser O(n) times is going to yield a significant performance gain no matter how much faster the parser is. However, I'm being downvoted, so perhaps I'm mistaken?

oblvious-earth 1962 days ago

Sometimes it's about optimizing wall time not algorithmic complexity.

If you have a batch SLA of 1 hour, and your currently spending 50-70 mins to complete the batch and 20 minutes of that time is spent date parsing and you can reduce it to 5 minutes that's an big win.

throwaway894345 1962 days ago

No doubt, but if your date parsing saves you 1 second per date parsed but each call into the faster library costs 2 seconds, then your performance actually suffers. The only way around this is to make a batch call such that the overhead is O(1).

minitech 1962 days ago

I’m not going to install it to check, but when someone writes “Fast datetime parser for Python written in Rust. Parses 10x-15x faster than datetime.strptime.” it seems reasonable to assume that this is not the case.

ahupp 1961 days ago

In a language like Java where you mostly spend time in the VM and only occasionally jump into native code, that might be true. But in python a huge part of the runtime is this kind of native call. So I would not expect that this approach adds any new overhead.

lincolnq 1962 days ago

My instinct is that the overhead is small. You need to add a few C stack frames and do some string conversion on each call, maybe an allocation to store the result. It’s not going to be as quick as doing in pure Rust, but the python-to-native code layer can be pretty lightweight I think!

brundolf 1962 days ago

Maybe they did it in bulk? i.e. send all the strings over at once, parse them in a loop, send them back. Seems like that would reduce overhead

throwaway894345 1962 days ago

Right, and that makes sense, but the context here is a date parsing library for Python--unless said library has a batch interface, I'm not sure how that would improve performance, but maybe I'm misestimating something.

brundolf 1962 days ago

Ah, I skimmed over the part where this is a library and not application-code

pbecotte 1962 days ago

I've certainly never been bottlenecked on date parsing :) However, many/most of the high performance python libraries are built in C code, and compiled down into something the python interpreter can use directly. There are lots of python bindings written in c++ to native c libraries as well, I know I have used ZeroMQ pretty recently. Rust is done the same way- the code is compiled down into objects that Python can use directly- its not like running a javascript interpreter in your code.

cdavid 1961 days ago

I have seen it in many cases, especially working on financial data. My most recent example was working with real time feeds of trades, which we used ML models on top of. Inference was based on accumulated volume per fixed amount of time (say 30 sec, 1 min), and the code doing this in real time was python.

I don't remember the numbers, but caching + using ciso8601 was essential to manage the peak load (maybe 50k trades per sec ?).

JPKab 1962 days ago

Thank you thank you thank you!

I was looking at PyO3 a few months ago, after discovering the orjson python (with rust inside) library and radically speeding up an auto-ML app for work.

I really enjoyed starting to learn Rust, but found the process to embed in Python to be rather intimidating. Looking forward to using your repo as a reference, and love the dtparse work you've done.

dmw_ng 1961 days ago

Another cheap trick if the time column is sequential is to split the string into date and time components, cache the date part and calculate the time part just with some multiplication

Major caveat is timezone handling, but this only applies in a subset of situations

quietbritishjim 1961 days ago

If you've got to that point of modifying the storage format then you might as well just use an integer (microseconds success the epoch) and be done with it. That seems cleaner than using a string (or two strings) anyway.