Hacker News new | ask | show | jobs
by scrlk 474 days ago
Ibis also solves this problem by providing a portable dataframe API that works across multiple backends (DuckDB by default): https://ibis-project.org/
5 comments

Disclosure, I am the author of Polars and this post. The difference with Ibis is that Polars cloud will also manage hardware. It is similar to Modal in that sense. You don't have to have a running cluster to fire a remote query.

The other is that we are only focussing on Polars and honor the Polars semantics and data model. Switching backends via Ibis doesn't honor this, as many architectures have different semantics regarding NaNs, missing data, order of them, decimal arithmetic behavior, regex engines, type upcasting, overflowing, etc.

And lastly, we will ensure it works seamlessly with the Polars landscape, that means that Polars Plugins and IO plugins will also be first class citizens.

It’s funny you mention Modal. I use modal to do fan-out processing of large-ish datasets. Right now I store the transient data in duckdb on modal, using polars (and sometimes ibis) as my api of choice.

I did this, rather than use snowflake, because our custom python “user defined functions” that process the data are not deployable on snowflake out of the gate, and the ergonomics of shipping custom code to modal are great, so I’m willing to pay a bit more complexity to ship data to modal in exchange for these great dev ergonomics.

All of that is to say: what does it look like to have custom python code running on my polars cloud in a distributed fashion? Is that a solved problem?

Yes, you can run

`pc.remote(my_udf, schema)`

Where

`def my_udf() -> DataFrame`

We link the appropiate Python version at cluster startup.

I've played around a bit with ibis for some internal analytics stuff, and honestly it's pretty nice to have one unified api for duckdb, postgres, etc. saves you from a ton of headaches switching context between different query languages and syntax quirks. but like you said, performance totally depends on the underlying backend, and sometimes that's a mixed bag—duckdb flies, but certain others can get sluggish with more complex joins and aggregations.

polars cloud might have an advantage here since they're optimizing directly around polars' own rust-based engine. i've done a fair bit of work lately using polars locally (huge fan of the lazy api), and if they can translate that speed and ergonomics smoothly into the cloud, it could be a real winner. the downside is obviously potential lock-in, but if it makes my day-to-day data wrangling faster, it might be worth the tradeoff.

curious to see benchmarks soon against dask, ray, and spark for some heavy analytics workloads.

My experience with it is that it's decent, but a "lowest-common denominator" solution. So you can write a few things agnostically, but once you need to write anything moderately complex, it gets a little annoying to work with. Also a lot of the backends aren't very performant (perhaps due to the translation/transpilation).
without locking you into a single cloud vendor ;)
wow, ibis supports nearly 20 backends, that's impressive