| If I understand correctly the currently promoted libraries for dataframes are: 1. Polars if data fits in ram 2. Vaex if data do not fit in ram 3. Spark with the dataframe api (koalas) if data do not fit in a computer Polars is great and delivers as promised |
1. Pandas if you stay in RAM, if the team and org already know this, but learn about reduced-ram types (eg float32 rather than float64, categorical for strings and dt if low cardinality, new Arrow strings in place of default Object str). Pandas 1.5 has an experimental copy-on-write option for more predictable (but probably still not "predictable") memory usage, try to use a subset of team-agreed functions (eg merge over join) due to varied defaults that'll confuse colleagues (eg inner Vs left and other differences). Buying more ram is normally a cheap (if inelegant) fix.
2. Dask as it is an easy transition from Pandas (and it scales numpy math, arbitrary python non-math functions and lots more), lots of cloud scaling options too. Stays within Python ecosystem for reduced cognitive load. It is probably less resource efficient than Vaex/Polars
3. Ignore Dask and stick with Spark if your team already uses it, as it'll scale to larger workloads and you've taken the cognitive and engineering hit (pragmatism over purity)
Vaex and Polars are definitely interesting (hi Ritchie!), and great if you're doing research and are comfortable with potentially changing APIs but you have no legacy systems to worry about. You might buy yourself a lot of future manoeuvring room. You'll find fewer clues to tricky problems in SO than for Pandas, and have a harder time hiring experienced help.