| It's a tech talk so I watched it and made some notes. It's an argument for using in-process databases when doing data science, rather than an external DB. The speaker pitches DuckDB as a concrete example, which seems to be an in-process DB for Python data frames. The speaker presents measurements showing how much overhead the wire protocols for various DBs have. MySQL is the best, Postgresql is orders of magnitude worse due to a very inefficient binary format design. The best is still 10x worse than netcat. Apache Arrow is trying to design a universal protocol for DB access that's more efficient than what's out there currently. Speaker asserts that scale-out is usually not needed in data analytics, no need to use Spark etc unless you want it on your CV. Audience member asks "what about multi-user/multi-process access", speaker admits DuckDB basically doesn't do that. Speaker pitches for using embedded in-proc DBs inside AWS Lambda functions. Not practical to install Oracle RDBMS in something that only runs for 100msec. A web shell for DuckDB is demonstrated, it uses WASM. Decentralization is pitched as a reason to avoid 2-tier architecture (separate db engine w/ client protocol). |
It's not only unpractical, but hard to get it done. Recently tried to run Postgres in an AWS Lambda to create an anonymized DB dump. It was so painful that I gave up and created an access restricted database to do the anonymization instead. An in-memory mode for Postgres that would be as easy to run as sqlite or duckdb would be so useful for things where one can not replace it with either of them (sql dumps, testing).