Hacker News new | ask | show | jobs
by chrisjc 1699 days ago
Makes sense. I should have included this functionality in my description of the value Arrow brings:

> read ... into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).

Very powerful indeed.

You lost me here though:

> Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already!

What is already being used for what?

The example that follows that describes the advantages of PySpark (Python/Scala) using Arrow makes sense, but I'm having trouble understanding your assertion relating it to SQLite and DuckDB?

1 comments

> What is already being used for what?

Let's say you have some data. You can choose to store it in a relational DB, like SQLite or DuckDB, or you can store it in a parquet file (and load it into an Arrow buffer).

And the point is that if you combine Arrow with, say, Spark, then as a user you can accomplish something similar to what you might accomplish with a relational DB. But you don't need to hassle with setting up a DB server and maintaining it. All you need is a job that outputs a parquet file, and uploads that to S3. And then Spark - through Arrow! - will allow you to execute queries against that DB.

Using Arrow + Spark, you get the ability to a dataframe as if it's SQL, but you can still do pandas-style stuff i.e. treat it as a dataframe. OTOH you lose the more esoteric SQL stuff like fancy constraints, triggers, foreign keys.