| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by giza182 39 days ago
	Perhaps a dumb question, but how do you get data into it if you’re not doing writes

5 comments

electroly 39 days ago

I have a system that builds SQLite databases and uploads them to S3. Once they're in S3, they are never changed. The program that builds the databases only does writes, and the program that queries the databases only does reads. It uses a VFS to query the database in-place with HTTP range requests.

This is indeed not an optimal setup. A more careful design from first principles would not require seeking around the file as much as SQLite does, we'd do a better job on reading exactly the correct range of bytes for a given query since we know ahead of time what the access patterns are, and we could do reads in parallel. With SQLite we have to be very careful about the schema design to ensure it won't have to seek too many times to answer a query. But SQLite was expedient, and I'm confident I'll always be able to read the files. That's less certain for a custom file format.

link

giovannibonetti 39 days ago

If it's going to be read-only, why not make it a Parquet file instead? It should result in a smaller file size due to columnar compression.

DuckDB has built-in capability to read Parquet files with HTTP range requests.

link

electroly 39 days ago

For this use case we need the ability to do an indexed query and extract a small number of rows from a large database. It's a traditional row-oriented database workload. I'm sure other solutions would also work, but SQLite's design melds well with the data. The migration from partitioned SQL Server tables to a collection of SQLite database files was straightforward.

link

1718627440 38 days ago

If it's that small, why not compile it directly into the application? The compiler might even be able to optimize a good chunk of data away, or optimize the code based on the data.

link

electroly 38 days ago

Nothing small about it. This is terabytes of data and new files are added daily with new data. That's why we had to move it into S3. It was costing too much to store in SQL Server.

I only reach for SQLite when it's big. Otherwise, I'd prefer to just use the existing SQL Server infrastructure, or .csv.gz files in S3. Internally, I use the term "bulk data" for data sets that are stored in S3 because they were too big for SQL Server.

link

andrelaszlo 39 days ago

I think it's just immutable once you've generated it. No need to update indexes or check consistency on writes, no need for transactions, etc.

link

tnelsond4 39 days ago

Generate it one time from a source tsv file or folder of media.

link

pfortuny 39 days ago

Think historical records of, say, share values for past years. You might have a single db for 1900-2000, for instance. Things like that.

Not everything needs to be real-time updated.

link

shermantanktop 39 days ago

It’s an RODB. Ship the preindexed data blob.

link