Hacker News new | ask | show | jobs
by theLiminator 677 days ago
Are you going to support OLAP use cases as well? I haven't yet found a really nice hybrid batch/streaming query engine with dataframe support.

Ideally, you'd support an api similar to Polars (which I have found to be the nicest thus far).

It'd also be important/useful to support Python udfs (think numpy/jax/etc.).

It'd be very cool if you could collaborate with or even tap into the polars frontend. If you could execute polars logical plans but with a streaming source, that would be huge.

2 comments

Have you looked at Databend? They support Flink CDC (https://docs.databend.com/guides/load-data/load-db/flink-cdc) so should be able to handle hybrid use cases.

I haven't looked at their Python API but they support PRQL which is a pretty nice and ergonomic interface in my (biased) opinion. See https://docs.databend.com/sql/sql-reference/ansi-sql#support...

DataFusion is primarily a batch OLAP system, so we should be able to support hybrid workloads as well. And definitely agree with you re: Polars dev exp. That is something we are aiming for with our forthcoming Python sdk.

> It'd also be important/useful to support Python udfs (think numpy/jax/etc.).

Yep that's our longterm gameplan.

> It'd be very cool if you could collaborate with or even tap into the polars frontend. If you could execute polars logical plans but with a streaming source, that would be huge.

Are there examples of project that do this? I'd be very much interested in looking into this.

> Are there examples of project that do this? I'd be very much interested in looking into this.

Nope, I don't believe there are. Unfortunately they don't seem like they're interested in exporting their logical plans to substrait, so there's no obvious way forward.

> DataFusion is primarily a batch OLAP system, so we should be able to support hybrid workloads as well. And definitely agree with you re: Polars dev exp. That is something we are aiming for with our forthcoming Python sdk.

Ah, since this is the case, it might also make sense to tap into the datafusion python bindings which recently got a massive overhaul to have a more similar dev ex as polars (though the docs are still quite a bit behind).

I'm looking forward to seeing what the result will be! I know Ibis also is an option, but with my little bit of playing around with it, I've found it's just the lowest common denominator and doesn't provide as nice of an experience as directly using polars (or whatever query engine api is provided).