| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by seertaak 1738 days ago

> What is already being used for what?

Let's say you have some data. You can choose to store it in a relational DB, like SQLite or DuckDB, or you can store it in a parquet file (and load it into an Arrow buffer).

And the point is that if you combine Arrow with, say, Spark, then as a user you can accomplish something similar to what you might accomplish with a relational DB. But you don't need to hassle with setting up a DB server and maintaining it. All you need is a job that outputs a parquet file, and uploads that to S3. And then Spark - through Arrow! - will allow you to execute queries against that DB.

Using Arrow + Spark, you get the ability to a dataframe as if it's SQL, but you can still do pandas-style stuff i.e. treat it as a dataframe. OTOH you lose the more esoteric SQL stuff like fancy constraints, triggers, foreign keys.