| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chrisjc 546 days ago

Thank you for all the work you guys do. The Arrow ecosystem is just absolutely incredible.

My few gripes related to interop with duckdb are related to Arrow scanning/pushdowns. And this extends to interop with other projects like pyiceberg too.

Registering an Arrow Dataset (or pyiceberg scan) as a "duckdb relation" (virtual view) is still a little problematic. Querying these "relations" does not always result in an optimal outcome.

For Arrow datasets, you can intercept the duckdb pushdown, but duckdb will have already "optimized" the plan to its liking, and any scanning restrictions that may have been more advantageous based on the nuances of the dataset might have been lost. Eg:

    WHERE A IN (3, 5, 7)

is presented to the Arrow scanner (pushdown) as "A is between 3 and 7 inclusive" (https://duckdb.org/docs/guides/performance/indexing.html#zon...).

Perhaps in a similar way, turning an pyiceberg scan into a relation for duckdb effectively takes the entire scan and creates an Arrow Table rather than some kind of pushdown/"scan plan" for duckdb to potentially make more efficient with its READ_PARQUET() functionality.

Most of this is probably dependent on duckdb development, but all of the incredible interop work done across communities/ecosystems so far gives me a lot of confidence that these will soon be matters of the past.

1 comments

1egg0myegg0 545 days ago

IN list filter predicate pushdown is much improved in DuckDB 1.2, coming in about a week! I am not sure if it applies to Arrow yet or not. Disclaimer: I work at MotherDuck and DuckDB Labs

link

ianmcook 545 days ago

@1egg0myegg0 that's great to hear. I'll check to see if it applies to Arrow.

Another performance issue with DuckDB/Arrow integration that we've been working to solve is that Arrow lacked a canonical way to pass statistics along with a stream of data. So for example if you're reading Parquet files and passing them to DuckDB, you would lose the ability to pass the Parquet column statistics to DuckDB for things like join order optimization. We recently added an API to Arrow to enable passing statistics, and the DuckDB devs are working to implement this. Discussion at https://github.com/apache/arrow/issues/38837.

link