| I'm a little confused with Arrow being included in the comparison. What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format? But isn't Arrow a format too? As I understand it, Arrow is a format optimized for transferring in-memory data from one distributed system to another (ser-des), while also facilitating and optimizing certain set operations. From RAM in one system to RAM in another. Moreover, since Arrow is a format, why is it being compared to databases like SQLite and DuckDB? If we're talking about formats, why not compare Arrow queries against Parquet data to DuckDB queries against Parquet data? https://duckdb.org/docs/data/parquet Why not at least benchmark the query execution alone instead of startup and loading of data? For Arrow, isn't it assumed that there is an engine like Spark or Snowflake already up and running that's serving you data in the Arrow format? Ideally, with Arrow you should never be dealing with data starting in a resting format like Parquet. The data should already be in RAM to reap the benefits of Arrow. Its value proposition is it'll get "live" data from point A to B as efficiently as possible, in an open, non-proprietary, ubiquitous (eventually) format. Exactly what of SQLite, DuckDB and Arrow is being compared here? I would assume the benefits of Arrow in R (or DataFrames in general) would be getting data from a data engine into your DataFrame runtime as efficiently as possible. (just as interesting might be where and how push-downs are handled) Perhaps I'm missing the trees for the forest? No disrespect to the author... Seems like they're on a quest for knowledge, and while the article is confusing to me, it certainly got me thinking. Disclaimer: I don't read R too good, and I'm still struggling with what exactly Arrow is. (Comparisons like this actually leave me even more confused about what Arrow is) |
This question comes up quite often. Parquet is a _file_ format, Arrow is a language-independent _in-memory_ format. You can e.g. read a parquet file into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).
Another way of looking at it, if you have a C++ background, is that (roughly speaking) it makes C++'s coolest feature - templates -, and the performance gains obtained by the concomitant inlinability of the generated code - available in other languages. For example, you can write `pa.array([1, 2], type=pa.uint16())` in python, which translates roughly to `std::vector<uint16_t>{1, 2}` in C++. But it's not quite that; Arrow arrays actually consist of several buffers, one of which is a bit mask indicating whether the next item in the array is valid or missing (what previously was accomplished by NaN).
While I'm not a huge fan of Arrow's inheritance-based C++ implementation (it's quite clunky to say the least), it's an important project IMHO.
Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already! For example, PySpark uses Arrow to mediate data between Python and Scala (the implementation language), providing access to the data through an SQL-like language.