Hacker News new | ask | show | jobs
by elyobo 2336 days ago
We do something like this; one of the outputs of the data pipeline is an sqlite file that's deployed nightly along with code to App Engine. The sqlite stuff is all read only, read/write data for the app is stored in firestore instead.

We initially used json but ran in to memory issues; sqlite is more memory efficient and being able to use SQL instead of the wild SQL-esque is both faster and more reliable.

2 comments

Yes, I have been doing same thing, only with LMDB.

I do not think LMDB could load from in-memory only object (as it has to have file to memory-map to), however.

But same design reasons, I wanted something that

a) I can move across host architectures

b) something that can act as key-val cache, as soon as the processes using it are restarted (so no cache hydrating delay)

c) something that I can diff/archive/restore/modify in place

We tested sqllite for the above purpose at the time, and writing speed and ( b ) - lmdb was significantly faster.

So we lost the flexibility of SQLite, but I felt it was a reasonable tradeoff, given our needs.

I also know that one of the Intel's python toolkits for image recognition/ai, uses LMDB (optionally) store images that processing routines do not have incur the cost of directory lookups when touching millions of small images. (forgot the name of the toolkit though)…

Overall, this a very valid practice/pattern in data processing pipelines, kudos to you for mentioning it.

"wild SQL-esque" should have been "wild SQL-esque thing I wrote to query the JSON"