| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by treesciencebot 960 days ago

Would be curious how the performance compares to DataFusion[0] as one of the top contenders to DuckDB on this area (albeit they being different in a lot of parts, I find it one of the closest compared to all others).

ClickBench (from ClickHouse) has some benchmarks[1] where it can be compared, but am not super sure how up to date it is. At least a while back, they were majorly out of date and haven't looked too closely on whether they are keeping it fair for everyone else :)

[0]: https://github.com/apache/arrow-datafusion

[1]: https://benchmark.clickhouse.com

4 comments

jabart 960 days ago

Looks like a recent PR bumped benchmark.clickhouse.com to DuckDB v0.9 on the 3rd.

https://github.com/ClickHouse/ClickBench/pull/141

link

sanderjd 960 days ago

Looks like DataFusion is included in most of the results in the article?

link

treesciencebot 960 days ago

You are right! Seems like it is not text-addressable which is why my ctrl+f searches failed.

link

leicmi 960 days ago

A paper on DataFusion is in progress[0].

The draft[1] includes a comparison to DuckDB and preliminary benchmark results.

[0]: https://github.com/apache/arrow-datafusion/issues/6782

[1]: https://www.overleaf.com/read/qjhrxqhgksvr

link

riku_iki 959 days ago

Why do you run benchmarks on such small datasets? It is very hard to judge performance..

link

slt2021 960 days ago

question about Arrow: the format seems to be not very space efficient.

I tried converting one of my parquet files from datalake from parquet to arrow and size difference is staggering. 20mb parquet -> 700mb arrow.

doesnt seem fit for datalake at all

link

ayhanfuat 960 days ago

Arrow is not really designed for storage though. See the "Parquet vs Arrow" section of this post (https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encod...):

> Parquet and Arrow are complementary technologies, and they make some different design tradeoffs. In particular, Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels.

> The major distinction is that Arrow provides O(1) random access lookups to any array index, whilst Parquet does not. In particular, Parquet uses dremel record shredding, variable length encoding schemes, and block compression to drastically reduce the data size, but these techniques come at the loss of performant random access lookups.

link

slt2021 960 days ago

the O(1) random access doesn't look like advantage at all, honestly.

if I load my parquet into memory - I will have O(1) random access to any row just as well.

plus, considering that Arrow recommends to work in chunks of 1000 rows per file, I am curious to learn exact tasks for which Arrow is optimizing for.

the only use case I can think of is transferring data between systems written in different languages/runtimes and doing zero serialization/deserialization, just send/receive memory buffers that are nicely mapped to dataframes.

link

throwboatyface 960 days ago

Into what format in memory? You can't just mmap into a parquet file and access whole records, it's a columnar format.

Arrow is absolutely designed for Interop between languages - often devs want to develop their core in something like Python, and the platform devs want to work on Scala. Arrow lets you write all the distributed system and data shuffling in Scala, but then users can access the same records in their user-defined code with minimal overhead

link

alamb 960 days ago

The following paper describes some of the tradeoffs between different formats

Deep Dive into Common Open Formats for Analytical DBMSs https://www.vldb.org/pvldb/vol16/p3044-liu.pdf

link

iwd 960 days ago

Do you have compression enabled? At least from Pandas, Parquet defaults to compressed and Arrow/Feather default to uncompressed. When I enable zstd compression, I get similar file sizes, and sometimes Arrow is smaller.

link

slt2021 960 days ago

I was just trying pandas native .to_parquet and .to_arrow() without any extra config knobs

link

lomereiter 960 days ago

Arrow format is not intended for storage, it's for in-memory data exchange between different libraries and languages.

link

the_optimist 960 days ago

Seems likely a memory management issue in the Arrow interface for the language you’re using.

link