Hacker News new | ask | show | jobs
by chrisjc 1692 days ago
I'm a little confused with Arrow being included in the comparison.

What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format?

But isn't Arrow a format too? As I understand it, Arrow is a format optimized for transferring in-memory data from one distributed system to another (ser-des), while also facilitating and optimizing certain set operations. From RAM in one system to RAM in another.

Moreover, since Arrow is a format, why is it being compared to databases like SQLite and DuckDB? If we're talking about formats, why not compare Arrow queries against Parquet data to DuckDB queries against Parquet data? https://duckdb.org/docs/data/parquet

Why not at least benchmark the query execution alone instead of startup and loading of data? For Arrow, isn't it assumed that there is an engine like Spark or Snowflake already up and running that's serving you data in the Arrow format? Ideally, with Arrow you should never be dealing with data starting in a resting format like Parquet. The data should already be in RAM to reap the benefits of Arrow. Its value proposition is it'll get "live" data from point A to B as efficiently as possible, in an open, non-proprietary, ubiquitous (eventually) format.

Exactly what of SQLite, DuckDB and Arrow is being compared here?

I would assume the benefits of Arrow in R (or DataFrames in general) would be getting data from a data engine into your DataFrame runtime as efficiently as possible. (just as interesting might be where and how push-downs are handled)

Perhaps I'm missing the trees for the forest?

No disrespect to the author... Seems like they're on a quest for knowledge, and while the article is confusing to me, it certainly got me thinking.

Disclaimer: I don't read R too good, and I'm still struggling with what exactly Arrow is. (Comparisons like this actually leave me even more confused about what Arrow is)

4 comments

> What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format?

This question comes up quite often. Parquet is a _file_ format, Arrow is a language-independent _in-memory_ format. You can e.g. read a parquet file into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).

Another way of looking at it, if you have a C++ background, is that (roughly speaking) it makes C++'s coolest feature - templates -, and the performance gains obtained by the concomitant inlinability of the generated code - available in other languages. For example, you can write `pa.array([1, 2], type=pa.uint16())` in python, which translates roughly to `std::vector<uint16_t>{1, 2}` in C++. But it's not quite that; Arrow arrays actually consist of several buffers, one of which is a bit mask indicating whether the next item in the array is valid or missing (what previously was accomplished by NaN).

While I'm not a huge fan of Arrow's inheritance-based C++ implementation (it's quite clunky to say the least), it's an important project IMHO.

Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already! For example, PySpark uses Arrow to mediate data between Python and Scala (the implementation language), providing access to the data through an SQL-like language.

Makes sense. I should have included this functionality in my description of the value Arrow brings:

> read ... into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).

Very powerful indeed.

You lost me here though:

> Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already!

What is already being used for what?

The example that follows that describes the advantages of PySpark (Python/Scala) using Arrow makes sense, but I'm having trouble understanding your assertion relating it to SQLite and DuckDB?

> What is already being used for what?

Let's say you have some data. You can choose to store it in a relational DB, like SQLite or DuckDB, or you can store it in a parquet file (and load it into an Arrow buffer).

And the point is that if you combine Arrow with, say, Spark, then as a user you can accomplish something similar to what you might accomplish with a relational DB. But you don't need to hassle with setting up a DB server and maintaining it. All you need is a job that outputs a parquet file, and uploads that to S3. And then Spark - through Arrow! - will allow you to execute queries against that DB.

Using Arrow + Spark, you get the ability to a dataframe as if it's SQL, but you can still do pandas-style stuff i.e. treat it as a dataframe. OTOH you lose the more esoteric SQL stuff like fancy constraints, triggers, foreign keys.

>Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already! For example, PySpark uses Arrow to mediate data between Python and Scala (the implementation language), providing access to the data through an SQL-like language.

That's like comparing SQLite to Scala because Spark is written in Scala and exposes a SQL interface.

This isn’t a comp sci geek doing performance analysis. It’s a guy who wants to do stats, and then throws up 3 options and discusses tradeoffs - memory and time consumption. He got his little workflow from a few minutes to a second, and that is what matters to him.
I was also confused at the headline but felt the comparison made sense by the end of the article.

> Why not at least benchmark the query execution alone instead of startup and loading of data?

Because they don’t intend to have R running with 4gb of data in memory constantly.

I’d assume there’s a SQLite /duckdb instance running, but that it is negligible. Maybe the test even includes starting the database server.

So the perspective is “I want some data on my hard drive, and I want to query it from R every once in a while, what are my options?”

And suddenly a file format becomes comparable to a database.

Another way to look at it is to consider the compiled arrow code in the arrow library as a database engine. Then arrow is just another alternative to SQLite.

I get what you're saying. This is their local workflow. Makes a little more sense looking at it this way. But...

In my opinion, the point about formats remains and the purpose of Arrow is lost.

I don't know enough about Arrow, but surely there is a better storage format than Parquet, all though storage isn't primary consideration for Arrow. The purpose of Arrow is not to have to convert from one format to another. Data can be efficiently transferred from RAM across the wire to RAM again without any significant transformation.

Surely there is a storage representation for Arrow to deliver similar characteristics as its intended use? Eg:

    disk (arrow format?) -> RAM (arrow)
instead of

    disk (parquet) -> RAM (parquet) -> CPU (transform) -> RAM (arrow)
There is a penalty (time and memory) dealing with parquet (or any intermediary format) and then transforming it to Arrow. What's the point of using Arrow if this is what you're going to do? Just use a parquet library instead of arrow (it's unclear to me what is actually performing the query in the Arrow step? Is it the R dataframe query, or did it push the query down to the Arrow data engine?).

After all, isn't this exactly how SQLite is being tested in this case? The original data file is loaded into SQLite and stored in SQLite's native file format providing all of the des-ser advantages SQLite provides out of the box. Not to mention the indexing that's defined as part of this preparation.

I think the best way of looking at this is arrow-format = feather-format

I think the format layouts are the same, the main difference between them being compression. Compressed data can in many cases be faster than uncompressed data to read/scan, (https://stackoverflow.com/questions/48083405/what-are-the-di...) so the cpu transform above is simply an uncompress step, which is notably simpler than what would go on with Sqlite to transform the data to a sqlite structure.

What we're looking at here in this test is a direct-data-access pattern (DDA), which is important as you can avoid ETL caching steps eg. Parquet->Postgres (which have an ingestion time) if you can access the data quick enough for your use case, and if the data is on say s3, you can have multiple (parallel) readers onto the same data rather than a connection pool for databases.

It also allows joining different larger-than-memory datasets efficiently, and avoids much of the infrastructure costs for something like Presto (Athena has per-query costs instead of infrastructure costs).

What I'd have liked to see, would be a Vaex/DuckDB benchmark, as the main differentiation between them appears to be SQL vs df/linq/dyplr/Rx semantics.

The Parquet libs I’ve used let you read the Arrow representation straight out of a Parquet file.

There’s also an Arrow “blob” format that can be used, but that’s not meant to be used as a full persistence format (as I understand)

The value proposition is that increasing numbers of query/analysis “engines” support the Arrow in memory format which is fantastic for interoperability. “I need a data frame library like Pandas but I’m not in Python” is no longer an issue.

I think it works best when you have multiple languages in one process. Eg a Java app passing data to an embedded python script, where both languages manipulate the same data structures in memory.
I don't believe there is any 'just parquet' library for R. The `arrow` package is perhaps the easiest way to handle parquet files in R.
(SQLite and DuckDB are both client-side databases.)
You mean in-process databases :)
Is database daemon the right word, when the database is running locally?

I thought the database process was still a server when running locally, like how Wayland is a display server.

There isn't a daemon to speak off, they are shared libraries that you load and call functions on.

I don't know much about R, but if it considers the whole memory space of the R process, it probably counted the memory consumption of the libraries themselves as well.

The memory profiling only measures allocations on the R heap (i.e. objects managed by R's GC). It doesn't measure allocations by other libraries running in the the same process, unless they use R's allocation machinery.
I assume Parquet has a close relationship with Arrow (project wise) and the parquet libraries I’ve used all supported reading the Arrow representation straight from the Parquet file.

I imagine they’re all getting compared because 2 of them are “embedded” relational DB (even though their data structure layout is optimised for different things) and the remaining option appears to present similar benefits and functionality in a different style.