| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jamesblonde 906 days ago

I disagree with this strongly - "The best way to store Apache Arrow dataframes in files on disk is with Feather. However, it’s also possible to convert to Apache Parquet format and others."

The best way to build your own non-JVM lakehouse is to use Iceberg for metadata, Parquet for the Data, Query with DuckDB using Arrow tables (read Parquet directly into Arrow is very low cost), and then use Arrow->Pandas or Polars (either directly or via a service with Arrow Flight).

If you put Feather in the mix, the whole Python lakehouse stack doesn't currently work.

1 comments

fbdab103 905 days ago

At one point, I thought Feather did not carry any long-term format guarantees. Presumably that has now changed, but I still feel like Parquet is the best future proof option on the table.

link