Hacker News new | ask | show | jobs
by datanecdote 2126 days ago
I am in similar boat. Python centric data scientist. Very tempted to try to learn Rust so I can accelerate certain ETL tasks.

Question for Rust experts: On what ETL tasks would you expect Rust to outperform Numpy, Numba, and Cython? What are the characteristics of a workload that sees order-of-magnitude speed ups from switching to Rust?

3 comments

I'm far from an expert, but I would not expect hand-written Rust code to outperform Numpy. Not because it's Rust and Numpy is written in C, but because Numpy has been deeply optimized over many years by many people and your custom code would not have been. When it comes to performance Rust is generally comparable to C++, as a baseline. It's not going to give you some dramatic advantage that offsets the code-maturity factor.

Now, if you're doing lots of computation in Python itself - not within the confines of Numpy - that's where you might see a significant speed boost. Again, I don't know precisely how Rust and Cython would compare, but I would very much expect Rust to be significantly faster, just as I would very much expect C++ to be significantly faster.

I deal with a lot of ragged data that is hard to vectorize, and currently write cython kernels when the inner loops take too long. Sounds like Rust might be faster than cython? Thanks for the feedback.
Also it might take 20x less RAM compared to using Python objects like sets and dicts. In Rust there's no garbage collection, and you can lay out memory by hand exactly as you want.
Most likely, yes
Julia might be a better fit for this use case.

That way you leverage a more developed data ecosystem, can call python when necessary and avoid writing low level code.

Depends on the task of course.

I’m fascinated by Julia and have test driven it before but it didn’t click for me. Maybe I was doing it wrong and/or the ecosystem has matured since I last looked.

I guess I generally do like the pythonic paradigm of an interpreted glue language orchestrating precompiled functions written in other languages. I don’t need or want to compile the entire pipeline end to end after every edit, that slows my development iteration cycle times.

I just want to write my own fast compiled functions to insert into the pipeline on the rare occasions I need something bespoke that doesn’t already exist in the extended python ecosystem. It seems like a lower level language would be optimal for that?

If the dev cycle feels slow in julia, you can make it snappier with a tool like Revise.jl, it is quite handy.

If you just need to fill a small and slow gap maybe something like numba is also a good option to stay within python.

Going all the way to a low level language would require the compilation, the glue code and expertise in both languages. Probably that slows down the development pipeline more than the JIT compilation from julia or numba.

Anyway, any opportunity to learn/practice some rust is also great!

One thing that may help with the glue-code aspect would be a crate like pyo3[0], which can generate a lot of the details for you.

[0] https://crates.io/crates/pyo3

column-wide map-reduce over large dataframes usually give you a 1000x or so speedup.

With rust you can stream each record and leverage the insane parallelism and async-io libs (rayon, crossbeam, tokio) and a very small memory footprint. sure you have asyncio in python but that’s nowhere near the speed of tokio.

Thanks for the pointers, those crates seem great. The flaky multithreading libs are my least favorite part of python, and rust’s strength in this area seems very appealing.