| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by IanOzsvald 1261 days ago

I'd argue a little differently. I'm co-author of O'Reilly's High Performance Python book and I've been teaching a course around this for years, often to quants.

1. Pandas if you stay in RAM, if the team and org already know this, but learn about reduced-ram types (eg float32 rather than float64, categorical for strings and dt if low cardinality, new Arrow strings in place of default Object str). Pandas 1.5 has an experimental copy-on-write option for more predictable (but probably still not "predictable") memory usage, try to use a subset of team-agreed functions (eg merge over join) due to varied defaults that'll confuse colleagues (eg inner Vs left and other differences). Buying more ram is normally a cheap (if inelegant) fix.

2. Dask as it is an easy transition from Pandas (and it scales numpy math, arbitrary python non-math functions and lots more), lots of cloud scaling options too. Stays within Python ecosystem for reduced cognitive load. It is probably less resource efficient than Vaex/Polars

3. Ignore Dask and stick with Spark if your team already uses it, as it'll scale to larger workloads and you've taken the cognitive and engineering hit (pragmatism over purity)

Vaex and Polars are definitely interesting (hi Ritchie!), and great if you're doing research and are comfortable with potentially changing APIs but you have no legacy systems to worry about. You might buy yourself a lot of future manoeuvring room. You'll find fewer clues to tricky problems in SO than for Pandas, and have a harder time hiring experienced help.

1 comments

ritchie46 1261 days ago

Hi Ian ;),

It depends on what let determine the order. Hiring experience and available content, I wholeheartedly agree with your list.

But if we order by performance/memory efficiency, A single threaded, (eager), library simply will be no comparison and should not top that list. In every TPCH query we ran, polars is orders of magnitudes faster than pandas.

https://www.pola.rs/benchmarks.html

Interopability with legacy systems should not be a concern. Polars is backed by arrow memory and arrow is becoming the default data transformation layer. Other than that, you can easily convert to pandas or numpy. That single copy is often no comparison with the time lost in a pandas join. Polars and pandas can work hand in hand, you don't have to fully replace one.

It is 2023, polars is used in production and is here to stay. IMO it should seriously be considered if performance and consistency is important to you.

link

IanOzsvald 1260 days ago

Hey Ritchie. Re legacy I'm thinking about wider teams in large organisations (eg SWEng system support teams) and IT mandating library upgrade frequency - switching to new libraries can have widespread impacts and the cost can be high. Polars (and Vaex) are definitely here to stay, but I think integration to existing teams may take a while. I followed the PRs around numpy data sharing but I wasn't sure on the end result. Is the data sharing copy-free (always?)? I wasn't sure what the impact was if Rust and NumPy are utilising the same bytes (or even if that was possible). Can you share some detail? Edit - reading the updated thread I your reply https://news.ycombinator.com/item?id=34298023 which says "1D often no copy", can you add any colour to when a 1D no copy can't happen and whether 2D no copy is an option?

link

fbdab103 1261 days ago

>But if we order by performance/memory efficiency

Right there is the disagreement. Like many (most?) people, all of my data munging is in small/medium data where 10 million+ rows is rare. A multiple of pandas performance will not be noticed for the majority of my operations.

Transitioning to a new api on performance alone is not enough to sway me. After all, I write in Python ;). If I were concerned about better throughput, my first alternative would be Dask - it should give better local performance, but could theoretically scale to enormous data without any code changes.

link

lr1970 1261 days ago

> In every TPCH query we ran, polars is orders of magnitudes faster than pandas.

I have no doubts that polars is faster than pandas. But the published TPCH results [0] are fairly outdated based on polars-0.13.51 while the current polars is 0.15.13. Are there any plans to refresh the benchmarks?

[0] https://www.pola.rs/benchmarks.html

link