Hacker News new | ask | show | jobs
by Waterluvian 1201 days ago
Mind you this isn’t appropriate for most cases. But I love the idea of “you start with text file. You end with text file. All the database stuff, indexes, etc. are just a detail.”

Often I find that the database wants to be the authority and that makes working with different formats a bit uncomfortable.

2 comments

We’re currently building real-time apis backed by terabytes of compressed parquet… hundreds of billions of ‘rows’… in exactly this fashion using polars. It amazes us at every turn.

Join us and help!

What project?

Do you mean polars reading Parquet into DuckDB to process that amount of data?

Internal. We're using Polars as the query engine to effectively query that data statically at rest (more accurately, mmap'd on disk in arrow ipc format)
What does this look like in practice? Using the filesystem as a database?
GNU Recutils https://www.gnu.org/software/recutils/

is a good example of an actual database that uses plaintext files in your filesystem.

I can see the argument that doing this with JSON is better (or worse), but regardless, Recutils is an interesting idea that i wish more people knew about. I can imagine a lot of cool things emerging if people would iterate on the idea.

Recutils is great, but it needs a rewrite, I think.
Anything that stores data on a computer is essentially a database. It's all about representation and what kinds of operations you prioritize for performance.
Apache Spark / Databricks is an example of this. Parquet files are stored in folders. A folder is assumed to hold one dataset split into multiple files based on specified partition criteria. The VMs read the necessary files into memory and then operate on it.
Isn't linux a good example of this? Everything is a file.