| > in-process think of sqlite. > OLAP think data warehouse. Columnar for analytical workloads. If you want something "in-process" then you're probably going to have to decide between sqlite and duckdb. If your workload is 1) individual fast and frequent read-write operations (OLTP), then you should probably pick sqlite.
2) massive amounts of read-heavy analytical operations (OLAP), then you should probably pick duckdb.
That's the decision process stated as simply as possible, but obviously there might be other options out there to consider.Postgres and MySQL are really better suited for out-of-process (shared server) workloads where multiple clients are interacting with the data and resources/compute. They are both row-based OLTP databases, although I do believe Postgres has an option for both table types (HTAP). Parquet is a file format, just as Avro, JSON, CSV,... are. Arrow (still grasping this one) is a way that data can be exchanged between systems and processes in such a way that the data is optimized in such a way that doesn't have to go through the extra steps of being shuffled around in memory. For example the data that is returned from a SQL query can be used directly in a Python/Scala/etc dataframe if using Arrow. I empathize with you about how confusing it all seems, but your curiosity will serve you well. I remember when I was asking these very same questions and it was being led down this road that opened my mind to the world of databases and data engineering. Google "olap vs oltp" and when you get that, then google "olap vs oltp vs htap". Or maybe read "The Log: What every software engineer should know about real-time data's unifying abstraction". I know how lame it sounds, but this article really did change the way I think about data. https://engineering.linkedin.com/distributed-systems/log-wha... |
Now THAT is easy to understand. Thank you.