| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by seertaak 1702 days ago

> What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format?

This question comes up quite often. Parquet is a _file_ format, Arrow is a language-independent _in-memory_ format. You can e.g. read a parquet file into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).

Another way of looking at it, if you have a C++ background, is that (roughly speaking) it makes C++'s coolest feature - templates -, and the performance gains obtained by the concomitant inlinability of the generated code - available in other languages. For example, you can write `pa.array([1, 2], type=pa.uint16())` in python, which translates roughly to `std::vector<uint16_t>{1, 2}` in C++. But it's not quite that; Arrow arrays actually consist of several buffers, one of which is a bit mask indicating whether the next item in the array is valid or missing (what previously was accomplished by NaN).

While I'm not a huge fan of Arrow's inheritance-based C++ implementation (it's quite clunky to say the least), it's an important project IMHO.

Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already! For example, PySpark uses Arrow to mediate data between Python and Scala (the implementation language), providing access to the data through an SQL-like language.

2 comments

chrisjc 1702 days ago

Makes sense. I should have included this functionality in my description of the value Arrow brings:

> read ... into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).

Very powerful indeed.

You lost me here though:

> Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already!

What is already being used for what?

The example that follows that describes the advantages of PySpark (Python/Scala) using Arrow makes sense, but I'm having trouble understanding your assertion relating it to SQLite and DuckDB?

link

seertaak 1701 days ago

> What is already being used for what?

Let's say you have some data. You can choose to store it in a relational DB, like SQLite or DuckDB, or you can store it in a parquet file (and load it into an Arrow buffer).

And the point is that if you combine Arrow with, say, Spark, then as a user you can accomplish something similar to what you might accomplish with a relational DB. But you don't need to hassle with setting up a DB server and maintaining it. All you need is a job that outputs a parquet file, and uploads that to S3. And then Spark - through Arrow! - will allow you to execute queries against that DB.

Using Arrow + Spark, you get the ability to a dataframe as if it's SQL, but you can still do pandas-style stuff i.e. treat it as a dataframe. OTOH you lose the more esoteric SQL stuff like fancy constraints, triggers, foreign keys.

link

marcinzm 1702 days ago

>Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already! For example, PySpark uses Arrow to mediate data between Python and Scala (the implementation language), providing access to the data through an SQL-like language.

That's like comparing SQLite to Scala because Spark is written in Scala and exposes a SQL interface.

link