Hacker News new | ask | show | jobs
by sweezyjeezy 1612 days ago
I would love to switch to something else, but it feels like pandas is lingua-franca in data science now, to switch puts a burden on everyone else.
3 comments

You can use DuckDB as a processing engine on top of Pandas [1], while continuing to use Pandas as a data storage/data interchange format.

[1] https://duckdb.org/2021/05/14/sql-on-pandas.html

That's what I do at $dayjob whenever I have to do windowing &c. Figuring out this stuff in Pandas is a waste of time. Before I discovered DuckDB, I would re-learn the API every damn time. I came up with a little utility function, which you can implement yourself :)

``` def sqldf(df: DataFrame, query: str) -> DataFrame: ... ```

Years of unpicking others use of Rs sqldf (which by default used to copy the entire data frame to a SQLite db, run the query, the copy the result set back) when they complained their R code was to slow has taught me a visceral, negative to the name and pattern.

Glad to to see duckDB delivering, finally, on the promise of running SQL against in-memory dataframes

TIL there's an actual 'botched' library with the same name; I actually came up with it independently on a lazy office afternoon :^)
I like interface-only packages in the Julia ecosystem e.g. Tables.jl enables the development of several packages for querying tabular data that work across many concrete implementations; Plots.jl separates the high-level plotting interface from the plotting backend.
Hah - I'm saying switching libraries is a headache - switching languages is absolutely not an option...
It's true. I've spent a small but nontrivial amount of time learning and using Polars, but it's just a nonstarter for most work projects. Not only does no one else know it exists, let alone how to use it, but it doesn't integrate with (to my knowledge) any ETL or ML Python library. You have to convert to pandas or NumPy, which is costly and to some extent defeats the purpose.
It says here: https://github.com/pola-rs/polars/issues/580#issuecomment-82... , that Polars has zero copy for arrow and numpy.
The to numpy conversion is free if you don't have missing data. Which is most of the cases if you send it over to a ML library.

If its not zero copy. It is still not a big deal. Pandas make a lot more copies internally. I truly wouldn't worry about that single copy if you have a order of magnitude speedup overall.

I stand corrected. The conversion felt relatively slow to me, but it was a large dataset and there were definitely missing values. Overall the benefits to speed and API cleanliness might be worth it, though it feels a bit gross to convert Spark to pandas to Polars to NumPy to DMatrix.

That said, it’s so much better than pandas for data manip that I’ll probably still try to use it.

Are you the author? If so, thanks for being so responsive on GitHub. You fixed basically every issue I had almost immediately back when I was learning Polars. It was awesome.

Yep, Thats me. Glad to help. :) There still room for parallelization when converting to a matrix. I will take a look. Haven't given that conversion any effort yet because that's often a one time conversion at the end of a pipeline.

But I will improve it. ;)