|
> What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format? This question comes up quite often. Parquet is a _file_ format, Arrow is a language-independent _in-memory_ format. You can e.g. read a parquet file into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies). Another way of looking at it, if you have a C++ background, is that (roughly speaking) it makes C++'s coolest feature - templates -, and the performance gains obtained by the concomitant inlinability of the generated code - available in other languages. For example, you can write `pa.array([1, 2], type=pa.uint16())` in python, which translates roughly to `std::vector<uint16_t>{1, 2}` in C++. But it's not quite that; Arrow arrays actually consist of several buffers, one of which is a bit mask indicating whether the next item in the array is valid or missing (what previously was accomplished by NaN). While I'm not a huge fan of Arrow's inheritance-based C++ implementation (it's quite clunky to say the least), it's an important project IMHO. Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already! For example, PySpark uses Arrow to mediate data between Python and Scala (the implementation language), providing access to the data through an SQL-like language. |
> read ... into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).
Very powerful indeed.
You lost me here though:
> Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already!
What is already being used for what?
The example that follows that describes the advantages of PySpark (Python/Scala) using Arrow makes sense, but I'm having trouble understanding your assertion relating it to SQLite and DuckDB?