| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by electroly 45 days ago
	I have a system that builds SQLite databases and uploads them to S3. Once they're in S3, they are never changed. The program that builds the databases only does writes, and the program that queries the databases only does reads. It uses a VFS to query the database in-place with HTTP range requests. This is indeed not an optimal setup. A more careful design from first principles would not require seeking around the file as much as SQLite does, we'd do a better job on reading exactly the correct range of bytes for a given query since we know ahead of time what the access patterns are, and we could do reads in parallel. With SQLite we have to be very careful about the schema design to ensure it won't have to seek too many times to answer a query. But SQLite was expedient, and I'm confident I'll always be able to read the files. That's less certain for a custom file format.

2 comments

giovannibonetti 45 days ago

If it's going to be read-only, why not make it a Parquet file instead? It should result in a smaller file size due to columnar compression.

DuckDB has built-in capability to read Parquet files with HTTP range requests.

link

electroly 45 days ago

For this use case we need the ability to do an indexed query and extract a small number of rows from a large database. It's a traditional row-oriented database workload. I'm sure other solutions would also work, but SQLite's design melds well with the data. The migration from partitioned SQL Server tables to a collection of SQLite database files was straightforward.

link

1718627440 44 days ago

If it's that small, why not compile it directly into the application? The compiler might even be able to optimize a good chunk of data away, or optimize the code based on the data.

link

electroly 44 days ago

Nothing small about it. This is terabytes of data and new files are added daily with new data. That's why we had to move it into S3. It was costing too much to store in SQL Server.

I only reach for SQLite when it's big. Otherwise, I'd prefer to just use the existing SQL Server infrastructure, or .csv.gz files in S3. Internally, I use the term "bulk data" for data sets that are stored in S3 because they were too big for SQL Server.

link