| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ritchie46 1260 days ago

Hi Ian ;),

It depends on what let determine the order. Hiring experience and available content, I wholeheartedly agree with your list.

But if we order by performance/memory efficiency, A single threaded, (eager), library simply will be no comparison and should not top that list. In every TPCH query we ran, polars is orders of magnitudes faster than pandas.

https://www.pola.rs/benchmarks.html

Interopability with legacy systems should not be a concern. Polars is backed by arrow memory and arrow is becoming the default data transformation layer. Other than that, you can easily convert to pandas or numpy. That single copy is often no comparison with the time lost in a pandas join. Polars and pandas can work hand in hand, you don't have to fully replace one.

It is 2023, polars is used in production and is here to stay. IMO it should seriously be considered if performance and consistency is important to you.

3 comments

IanOzsvald 1259 days ago

Hey Ritchie. Re legacy I'm thinking about wider teams in large organisations (eg SWEng system support teams) and IT mandating library upgrade frequency - switching to new libraries can have widespread impacts and the cost can be high. Polars (and Vaex) are definitely here to stay, but I think integration to existing teams may take a while. I followed the PRs around numpy data sharing but I wasn't sure on the end result. Is the data sharing copy-free (always?)? I wasn't sure what the impact was if Rust and NumPy are utilising the same bytes (or even if that was possible). Can you share some detail? Edit - reading the updated thread I your reply https://news.ycombinator.com/item?id=34298023 which says "1D often no copy", can you add any colour to when a 1D no copy can't happen and whether 2D no copy is an option?

link

fbdab103 1260 days ago

>But if we order by performance/memory efficiency

Right there is the disagreement. Like many (most?) people, all of my data munging is in small/medium data where 10 million+ rows is rare. A multiple of pandas performance will not be noticed for the majority of my operations.

Transitioning to a new api on performance alone is not enough to sway me. After all, I write in Python ;). If I were concerned about better throughput, my first alternative would be Dask - it should give better local performance, but could theoretically scale to enormous data without any code changes.

link

lr1970 1260 days ago

> In every TPCH query we ran, polars is orders of magnitudes faster than pandas.

I have no doubts that polars is faster than pandas. But the published TPCH results [0] are fairly outdated based on polars-0.13.51 while the current polars is 0.15.13. Are there any plans to refresh the benchmarks?

[0] https://www.pola.rs/benchmarks.html

link