Hacker News new | ask | show | jobs
by antman 1260 days ago
If I understand correctly the currently promoted libraries for dataframes are:

1. Polars if data fits in ram

2. Vaex if data do not fit in ram

3. Spark with the dataframe api (koalas) if data do not fit in a computer

Polars is great and delivers as promised

4 comments

I'd argue a little differently. I'm co-author of O'Reilly's High Performance Python book and I've been teaching a course around this for years, often to quants.

1. Pandas if you stay in RAM, if the team and org already know this, but learn about reduced-ram types (eg float32 rather than float64, categorical for strings and dt if low cardinality, new Arrow strings in place of default Object str). Pandas 1.5 has an experimental copy-on-write option for more predictable (but probably still not "predictable") memory usage, try to use a subset of team-agreed functions (eg merge over join) due to varied defaults that'll confuse colleagues (eg inner Vs left and other differences). Buying more ram is normally a cheap (if inelegant) fix.

2. Dask as it is an easy transition from Pandas (and it scales numpy math, arbitrary python non-math functions and lots more), lots of cloud scaling options too. Stays within Python ecosystem for reduced cognitive load. It is probably less resource efficient than Vaex/Polars

3. Ignore Dask and stick with Spark if your team already uses it, as it'll scale to larger workloads and you've taken the cognitive and engineering hit (pragmatism over purity)

Vaex and Polars are definitely interesting (hi Ritchie!), and great if you're doing research and are comfortable with potentially changing APIs but you have no legacy systems to worry about. You might buy yourself a lot of future manoeuvring room. You'll find fewer clues to tricky problems in SO than for Pandas, and have a harder time hiring experienced help.

Hi Ian ;),

It depends on what let determine the order. Hiring experience and available content, I wholeheartedly agree with your list.

But if we order by performance/memory efficiency, A single threaded, (eager), library simply will be no comparison and should not top that list. In every TPCH query we ran, polars is orders of magnitudes faster than pandas.

https://www.pola.rs/benchmarks.html

Interopability with legacy systems should not be a concern. Polars is backed by arrow memory and arrow is becoming the default data transformation layer. Other than that, you can easily convert to pandas or numpy. That single copy is often no comparison with the time lost in a pandas join. Polars and pandas can work hand in hand, you don't have to fully replace one.

It is 2023, polars is used in production and is here to stay. IMO it should seriously be considered if performance and consistency is important to you.

Hey Ritchie. Re legacy I'm thinking about wider teams in large organisations (eg SWEng system support teams) and IT mandating library upgrade frequency - switching to new libraries can have widespread impacts and the cost can be high. Polars (and Vaex) are definitely here to stay, but I think integration to existing teams may take a while. I followed the PRs around numpy data sharing but I wasn't sure on the end result. Is the data sharing copy-free (always?)? I wasn't sure what the impact was if Rust and NumPy are utilising the same bytes (or even if that was possible). Can you share some detail? Edit - reading the updated thread I your reply https://news.ycombinator.com/item?id=34298023 which says "1D often no copy", can you add any colour to when a 1D no copy can't happen and whether 2D no copy is an option?
>But if we order by performance/memory efficiency

Right there is the disagreement. Like many (most?) people, all of my data munging is in small/medium data where 10 million+ rows is rare. A multiple of pandas performance will not be noticed for the majority of my operations.

Transitioning to a new api on performance alone is not enough to sway me. After all, I write in Python ;). If I were concerned about better throughput, my first alternative would be Dask - it should give better local performance, but could theoretically scale to enormous data without any code changes.

> In every TPCH query we ran, polars is orders of magnitudes faster than pandas.

I have no doubts that polars is faster than pandas. But the published TPCH results [0] are fairly outdated based on polars-0.13.51 while the current polars is 0.15.13. Are there any plans to refresh the benchmarks?

[0] https://www.pola.rs/benchmarks.html

I actually thought polars’ lazy api would allow for out-of-ram computation?

Also dask is more flexible than spark, since it lets you deal with numpy arrays and arbitrary objects better than spark can.

It does. Though the functionality is quite new, we will extend this.

Calling `collect(streaming=True)` on a `LazyFrame` will allow you to process datasets that don't fit into memory. This currently works for groupbys, joins, many functions, filter etc.

We will extend this to sorts and likely other operations as well.

I'm curious if you could use this not for data science tasks but for data engineering tasks - say read a csv or pull a table from oracle and store it as delta lake table or something.

I know its a boring use case, but the challenge with it is that it is a complete waste of money and carbon footprint to use Spark to process a 20 MB CSV or table with few thousand records, but tools like Pandas fall apart when you hit a 50 GB CSV or table with few billion records.

Something more efficient (say, in Rust and not Python or Java) and yet scalable (due to not fitting everything into memory) would be a great help here.

This is exactly what we are aiming for. There are already a lot of queries that can be processed with 100s GBS of data on my 16GB laptop.

And we will extend functionality for out of core processing. A single node can do a lot!

Can you please put an example about dask being more flexible than spark?
This https://docs.dask.org/en/stable/spark.html notes "However, Dask is able to easily represent far more complex algorithms and expose the creation of these algorithms to normal users [compared to spark]" linking to: http://matthewrocklin.com/blog/work/2015/06/26/Complex-Graph...
Yes, say you’ve got imaging data in 3 (or higher) dimensions as numpy arrays and want to run some sort of algorithm on multiple cores/machines. Could be both for data analytics and simulations.

dask.bag has generic parallel processing capabilities. Query a database, a rest api, something. Then merge into dataframes across dask workers.

My understanding is polars will stream if data does not fit in RAM.

Between Polars and Spark Dataframe APIs (not Koalas) as well as the occasional dplyr, I will gladly abandon Pandas.

imo this is wrong in a few ways. firstly your data fits in a computer. you can get a computer with a petabyte of storage if you need to, and spark is slow enough that doing it in a single computer will probably be faster. also while a computer with 1pb of storage is expensive, it's less expensive than splitting your data up in terms of hardware, maintenance, and software dev time costs.

secondly, your data probably fits in RAM if you actually try. your can get a computer with 60TB of RAM which is an awful lot of data.

But does the data fit in compute? Assuming computation needs to be done on the data you could quickly run out of compute power on a single machine. Especially since cost of CPU power is super linear with some pretty hard limits.
You can get 256 cores of Zen4 (with 8 GPUs) in a box. That's a lot of compute. You definitely can be compute constrained but if you are writing efficient code you can do a lot.
That's $44K in CPUs alone, and these clock slower than desktop cores so it's not exactly equivalent. There are many tasks for which GPUs are not relevant. I have an expensive compute limited task and for my purposes a mini cluster of desktop CPUs still seems to be the way to go.
Lets say that the cluster of desktop CPUs is 10 CPUs with 8 cores each for 80 cores at 4ghz in total. Instead you can get a server with 64 cores at 2.5 GHz and 1TB of memory for $50k from Dell (adding another 64 cores would be about $6000 more). While this may sound like a lot less performance, you won't be wasting a ton of performance reading data from disk and communicating over a network All of your cores share the same RAM and cache so your computation will be a ton faster. Furthermore, your cores will all be running the same instruction set so you can take more advantage of vectorization). If you aren't using compute frequently enough to justify a server like this, you can get that level of performance and memory from AWS for $4 an hour.
My task has next to no network or hdd. Data parallel and the task is mostly branches, so GPUs are not effective either. And why use 8 core desktops when 16 core desktops exist. 10 desktops at $1.5K each is way cheaper. It’s also the kind of data that can go into the cloud. Not saying it’s the typical workload but it is my workload. Also it’s CPU bound so they go 100% all the time.
And if your workload is FP and nicely parallelizes and fits in 12 GB or less than you might be able to get that kind of performance for the price of a high end GPU + a box to stick it in. GPUs are insanely cost effective for problems that match their geometry. So much so that it can be worth thinking about it for a little while to see if you can make it fit.