Hacker News new | ask | show | jobs
by chrisjc 1702 days ago
I get what you're saying. This is their local workflow. Makes a little more sense looking at it this way. But...

In my opinion, the point about formats remains and the purpose of Arrow is lost.

I don't know enough about Arrow, but surely there is a better storage format than Parquet, all though storage isn't primary consideration for Arrow. The purpose of Arrow is not to have to convert from one format to another. Data can be efficiently transferred from RAM across the wire to RAM again without any significant transformation.

Surely there is a storage representation for Arrow to deliver similar characteristics as its intended use? Eg:

    disk (arrow format?) -> RAM (arrow)
instead of

    disk (parquet) -> RAM (parquet) -> CPU (transform) -> RAM (arrow)
There is a penalty (time and memory) dealing with parquet (or any intermediary format) and then transforming it to Arrow. What's the point of using Arrow if this is what you're going to do? Just use a parquet library instead of arrow (it's unclear to me what is actually performing the query in the Arrow step? Is it the R dataframe query, or did it push the query down to the Arrow data engine?).

After all, isn't this exactly how SQLite is being tested in this case? The original data file is loaded into SQLite and stored in SQLite's native file format providing all of the des-ser advantages SQLite provides out of the box. Not to mention the indexing that's defined as part of this preparation.

4 comments

I think the best way of looking at this is arrow-format = feather-format

I think the format layouts are the same, the main difference between them being compression. Compressed data can in many cases be faster than uncompressed data to read/scan, (https://stackoverflow.com/questions/48083405/what-are-the-di...) so the cpu transform above is simply an uncompress step, which is notably simpler than what would go on with Sqlite to transform the data to a sqlite structure.

What we're looking at here in this test is a direct-data-access pattern (DDA), which is important as you can avoid ETL caching steps eg. Parquet->Postgres (which have an ingestion time) if you can access the data quick enough for your use case, and if the data is on say s3, you can have multiple (parallel) readers onto the same data rather than a connection pool for databases.

It also allows joining different larger-than-memory datasets efficiently, and avoids much of the infrastructure costs for something like Presto (Athena has per-query costs instead of infrastructure costs).

What I'd have liked to see, would be a Vaex/DuckDB benchmark, as the main differentiation between them appears to be SQL vs df/linq/dyplr/Rx semantics.

The Parquet libs I’ve used let you read the Arrow representation straight out of a Parquet file.

There’s also an Arrow “blob” format that can be used, but that’s not meant to be used as a full persistence format (as I understand)

The value proposition is that increasing numbers of query/analysis “engines” support the Arrow in memory format which is fantastic for interoperability. “I need a data frame library like Pandas but I’m not in Python” is no longer an issue.

I think it works best when you have multiple languages in one process. Eg a Java app passing data to an embedded python script, where both languages manipulate the same data structures in memory.
I don't believe there is any 'just parquet' library for R. The `arrow` package is perhaps the easiest way to handle parquet files in R.