Hacker News new | ask | show | jobs
by mgradowski 1612 days ago
DuckDB and Polars are my bets in the Python data-wrangling space. I grew tired of Pandas' weird-ass API.
5 comments

I would love to switch to something else, but it feels like pandas is lingua-franca in data science now, to switch puts a burden on everyone else.
You can use DuckDB as a processing engine on top of Pandas [1], while continuing to use Pandas as a data storage/data interchange format.

[1] https://duckdb.org/2021/05/14/sql-on-pandas.html

That's what I do at $dayjob whenever I have to do windowing &c. Figuring out this stuff in Pandas is a waste of time. Before I discovered DuckDB, I would re-learn the API every damn time. I came up with a little utility function, which you can implement yourself :)

``` def sqldf(df: DataFrame, query: str) -> DataFrame: ... ```

Years of unpicking others use of Rs sqldf (which by default used to copy the entire data frame to a SQLite db, run the query, the copy the result set back) when they complained their R code was to slow has taught me a visceral, negative to the name and pattern.

Glad to to see duckDB delivering, finally, on the promise of running SQL against in-memory dataframes

TIL there's an actual 'botched' library with the same name; I actually came up with it independently on a lazy office afternoon :^)
I like interface-only packages in the Julia ecosystem e.g. Tables.jl enables the development of several packages for querying tabular data that work across many concrete implementations; Plots.jl separates the high-level plotting interface from the plotting backend.
Hah - I'm saying switching libraries is a headache - switching languages is absolutely not an option...
It's true. I've spent a small but nontrivial amount of time learning and using Polars, but it's just a nonstarter for most work projects. Not only does no one else know it exists, let alone how to use it, but it doesn't integrate with (to my knowledge) any ETL or ML Python library. You have to convert to pandas or NumPy, which is costly and to some extent defeats the purpose.
It says here: https://github.com/pola-rs/polars/issues/580#issuecomment-82... , that Polars has zero copy for arrow and numpy.
The to numpy conversion is free if you don't have missing data. Which is most of the cases if you send it over to a ML library.

If its not zero copy. It is still not a big deal. Pandas make a lot more copies internally. I truly wouldn't worry about that single copy if you have a order of magnitude speedup overall.

I stand corrected. The conversion felt relatively slow to me, but it was a large dataset and there were definitely missing values. Overall the benefits to speed and API cleanliness might be worth it, though it feels a bit gross to convert Spark to pandas to Polars to NumPy to DMatrix.

That said, it’s so much better than pandas for data manip that I’ll probably still try to use it.

Are you the author? If so, thanks for being so responsive on GitHub. You fixed basically every issue I had almost immediately back when I was learning Polars. It was awesome.

Yep, Thats me. Glad to help. :) There still room for parallelization when converting to a matrix. I will take a look. Haven't given that conversion any effort yet because that's often a one time conversion at the end of a pipeline.

But I will improve it. ;)

Yes I used it for the first time in ages recently and I have to say I found the whole thing a mess. There are about 5 ways to do everything.
I don't know DuckDB but polars could dethrone pandas. We're planning on using it to create our pipeline. Ibis-project is another solution if anyone wants to check it out.
Huh, even though I would prefer a universal SQL layer, ibis looks quite nice.
I mean I haven't heard about DuckDB.
I haven't touched pandas in months, but I also found quite tiring to deal with pandas.

Does your setup allow for an end-to-end solution? I mean, can I sink time into that setup and feel like I have everything I need to for regular data-wrangling?

I'm sure Pandas is amazing, but as a newbie I found myself doing many transformation logic with python data structures because it's just so much easier.

Maybe I'm dumb but going around the docs sometimes was like :/

Author of the post and siuba here. I'm pretty interested in exploring supporting polars as a backend, and if it works well supporting versions of the SQL backends that translate to SQL based on the polars method API :).

(I haven't really used it, but it looks promising)

Hey, I love siuba. Haven't had a chance to use it much but it scratches an itch for me. For years I've grumbled about how Python isn't flexible enough to accommodate tidyverse style libraries, as it lacks pipes and lazy evaluation (or macros), but siuba has managed to be very nice to use.

Maybe someday Python'll get a macro system ...