| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Starlord2048 474 days ago

I can appreciate the pain points you guys are addressing.

The "diagonal scaling" approach seems particularly clever - dynamically choosing between horizontal and vertical scaling based on the query characteristics rather than forcing users into a one-size-fits-all model. Most real-world data workloads have mixed requirements, so this flexibility could be a major advantage.

I'm curious how the new streaming engine with out-of-core processing will compare to Dask, which has been in this space for a while but hasn't quite achieved the adoption of pandas/PySpark despite its strengths.

The unified API approach also tackles a real issue. The cognitive overhead of switching between pandas for local work and PySpark for distributed work is higher than most people acknowledge. Having a consistent mental model regardless of scale would be a productivity boost.

Anyway, I would love to apply for the early access and try it out. I'd be particularly interested in seeing benchmark comparisons against Ray, Dask, and Spark for different workload profiles. Also curious about the pricing model and the cold start problem that plagues many distributed systems.

1 comments

scrlk 474 days ago

Ibis also solves this problem by providing a portable dataframe API that works across multiple backends (DuckDB by default): https://ibis-project.org/

link

ritchie46 474 days ago

Disclosure, I am the author of Polars and this post. The difference with Ibis is that Polars cloud will also manage hardware. It is similar to Modal in that sense. You don't have to have a running cluster to fire a remote query.

The other is that we are only focussing on Polars and honor the Polars semantics and data model. Switching backends via Ibis doesn't honor this, as many architectures have different semantics regarding NaNs, missing data, order of them, decimal arithmetic behavior, regex engines, type upcasting, overflowing, etc.

And lastly, we will ensure it works seamlessly with the Polars landscape, that means that Polars Plugins and IO plugins will also be first class citizens.

link

TheTaytay 473 days ago

It’s funny you mention Modal. I use modal to do fan-out processing of large-ish datasets. Right now I store the transient data in duckdb on modal, using polars (and sometimes ibis) as my api of choice.

I did this, rather than use snowflake, because our custom python “user defined functions” that process the data are not deployable on snowflake out of the gate, and the ergonomics of shipping custom code to modal are great, so I’m willing to pay a bit more complexity to ship data to modal in exchange for these great dev ergonomics.

All of that is to say: what does it look like to have custom python code running on my polars cloud in a distributed fashion? Is that a solved problem?

link

ritchie46 473 days ago

Yes, you can run

`pc.remote(my_udf, schema)`

Where

`def my_udf() -> DataFrame`

We link the appropiate Python version at cluster startup.

link

ZeroTalent 474 days ago

I've played around a bit with ibis for some internal analytics stuff, and honestly it's pretty nice to have one unified api for duckdb, postgres, etc. saves you from a ton of headaches switching context between different query languages and syntax quirks. but like you said, performance totally depends on the underlying backend, and sometimes that's a mixed bag—duckdb flies, but certain others can get sluggish with more complex joins and aggregations.

polars cloud might have an advantage here since they're optimizing directly around polars' own rust-based engine. i've done a fair bit of work lately using polars locally (huge fan of the lazy api), and if they can translate that speed and ergonomics smoothly into the cloud, it could be a real winner. the downside is obviously potential lock-in, but if it makes my day-to-day data wrangling faster, it might be worth the tradeoff.

curious to see benchmarks soon against dask, ray, and spark for some heavy analytics workloads.

link

theLiminator 474 days ago

My experience with it is that it's decent, but a "lowest-common denominator" solution. So you can write a few things agnostically, but once you need to write anything moderately complex, it gets a little annoying to work with. Also a lot of the backends aren't very performant (perhaps due to the translation/transpilation).

link

codydkdc 474 days ago

without locking you into a single cloud vendor ;)

link

Starlord2048 474 days ago

wow, ibis supports nearly 20 backends, that's impressive

link