| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mgradowski 1612 days ago
	DuckDB and Polars are my bets in the Python data-wrangling space. I grew tired of Pandas' weird-ass API.

5 comments

sweezyjeezy 1612 days ago

I would love to switch to something else, but it feels like pandas is lingua-franca in data science now, to switch puts a burden on everyone else.

link

mytherin 1612 days ago

You can use DuckDB as a processing engine on top of Pandas [1], while continuing to use Pandas as a data storage/data interchange format.

[1] https://duckdb.org/2021/05/14/sql-on-pandas.html

link

mgradowski 1612 days ago

That's what I do at $dayjob whenever I have to do windowing &c. Figuring out this stuff in Pandas is a waste of time. Before I discovered DuckDB, I would re-learn the API every damn time. I came up with a little utility function, which you can implement yourself :)

``` def sqldf(df: DataFrame, query: str) -> DataFrame: ... ```

link

kristjansson 1612 days ago

Years of unpicking others use of Rs sqldf (which by default used to copy the entire data frame to a SQLite db, run the query, the copy the result set back) when they complained their R code was to slow has taught me a visceral, negative to the name and pattern.

Glad to to see duckDB delivering, finally, on the promise of running SQL against in-memory dataframes

link

mgradowski 1612 days ago

TIL there's an actual 'botched' library with the same name; I actually came up with it independently on a lazy office afternoon :^)

link

mgradowski 1612 days ago

I like interface-only packages in the Julia ecosystem e.g. Tables.jl enables the development of several packages for querying tabular data that work across many concrete implementations; Plots.jl separates the high-level plotting interface from the plotting backend.

link

sweezyjeezy 1612 days ago

Hah - I'm saying switching libraries is a headache - switching languages is absolutely not an option...

link

mrtranscendence 1612 days ago

It's true. I've spent a small but nontrivial amount of time learning and using Polars, but it's just a nonstarter for most work projects. Not only does no one else know it exists, let alone how to use it, but it doesn't integrate with (to my knowledge) any ETL or ML Python library. You have to convert to pandas or NumPy, which is costly and to some extent defeats the purpose.

link

elforce002 1612 days ago

It says here: https://github.com/pola-rs/polars/issues/580#issuecomment-82... , that Polars has zero copy for arrow and numpy.

link

ritchie46 1611 days ago

The to numpy conversion is free if you don't have missing data. Which is most of the cases if you send it over to a ML library.

If its not zero copy. It is still not a big deal. Pandas make a lot more copies internally. I truly wouldn't worry about that single copy if you have a order of magnitude speedup overall.

link

mrtranscendence 1611 days ago

I stand corrected. The conversion felt relatively slow to me, but it was a large dataset and there were definitely missing values. Overall the benefits to speed and API cleanliness might be worth it, though it feels a bit gross to convert Spark to pandas to Polars to NumPy to DMatrix.

That said, it’s so much better than pandas for data manip that I’ll probably still try to use it.

Are you the author? If so, thanks for being so responsive on GitHub. You fixed basically every issue I had almost immediately back when I was learning Polars. It was awesome.

link

ritchie46 1611 days ago

Yep, Thats me. Glad to help. :) There still room for parallelization when converting to a matrix. I will take a look. Haven't given that conversion any effort yet because that's often a one time conversion at the end of a pipeline.

But I will improve it. ;)

link

anonymousDan 1612 days ago

Yes I used it for the first time in ages recently and I have to say I found the whole thing a mess. There are about 5 ways to do everything.

link

elforce002 1612 days ago

I don't know DuckDB but polars could dethrone pandas. We're planning on using it to create our pipeline. Ibis-project is another solution if anyone wants to check it out.

link

mgradowski 1612 days ago

Huh, even though I would prefer a universal SQL layer, ibis looks quite nice.

link

elforce002 1612 days ago

I mean I haven't heard about DuckDB.

link

spaniard89277 1612 days ago

I haven't touched pandas in months, but I also found quite tiring to deal with pandas.

Does your setup allow for an end-to-end solution? I mean, can I sink time into that setup and feel like I have everything I need to for regular data-wrangling?

I'm sure Pandas is amazing, but as a newbie I found myself doing many transformation logic with python data structures because it's just so much easier.

Maybe I'm dumb but going around the docs sometimes was like :/

link

closed 1612 days ago

Author of the post and siuba here. I'm pretty interested in exploring supporting polars as a backend, and if it works well supporting versions of the SQL backends that translate to SQL based on the polars method API :).

(I haven't really used it, but it looks promising)

link

mrtranscendence 1612 days ago

Hey, I love siuba. Haven't had a chance to use it much but it scratches an itch for me. For years I've grumbled about how Python isn't flexible enough to accommodate tidyverse style libraries, as it lacks pipes and lazy evaluation (or macros), but siuba has managed to be very nice to use.

Maybe someday Python'll get a macro system ...

link