Hacker News new | ask | show | jobs
by pimbrouwers 486 days ago
Be brave. Just try it.

Removing the network latency is a massive gain, especially for small/medium apps, which is most sites. And if you find yourself asking this, the answer is yes, it will work.

SQLite will force you to rethink your architecture a bit, don't forget it doesn't HAVE to be _one_ file. This alone helps you scale further.

There are some chafing points: 1. Schema migrations 2. Configuration, PRAGMA's applied at design time or during each connection

8 comments

> Removing the network latency is a massive gain, especially for small/medium apps, which is most sites.

I've seen some apps that I can only describe as singleton apps, not even monoliths with their architecture - where you can only ever have one vertically scaled instance due to their reliance on the memory for something like storing user sessions (yes, everyone has to log in or use SSO after a restart), or the local file system for something like ingesting documents or generating reports.

At that point, one might as well ditch Oracle DB (used as example here, which is suprisingly common in those setups) and just work with SQLite directly, because if people made that their architecture then N+1 issues are also pretty much inevitable. It's not like SQLite will magically fix your code, but at least it will get rid of the network round trip when you try to execute 400 SQL SELECTs to render one view.

I will admit, that generally I enjoy architectures where the components are properly separated: front end served from a separate container, the back end API in another one, something like Redis/Valkey or RabbitMQ in that stack as well, in addition to a traditional RDBMS like PostgreSQL or even MariaDB/MySQL. On the other hand, I have made smaller projects for my own needs that use Ruby on Rails or Laravel or Django and SQLite goes really well with those.

Removing the network latency is a massive gain

People are comparing sqlite to RDS or whatever which completely biases the comparison. We have small apps running in containers talking to Postgres in a container via kubernetes networking on the same VM.

Benchmark.bm{ |x| x.report{ 1000.times{ ActiveRecord::Base.connection.execute("select 1") }}}

       user     system      total        real
   0.040279   0.012958   0.053237 (  0.080539)
0.08 seconds for 1000 DB queries is not causing network latency problems for any real world app. Just use Postgres unless you really have good reasons not to.
"Just use Postgres" is as useless as any other "Just use X".

You're trying to compare using SQLite which is stored in a file on the system... with using Postgres in a container via kubernetes on the same VM? What happens when you need to move that Postgres container somewhere else? How do you scale it? If you don't, then you're just using a solution orders of magnitude more complicated than SQLite to get the same functionality and locality as SQLite.

The important nuance here is that with SQLite you can achieve similar numbers even if each of the 1000 queries is dependent on prior results.

SELECT 1 is hardly representative of a real world situation wherein you need to logically serialize each transition of business state.

0.08 seconds for a few bytes, which are fully cached, yet it barely can reach 12500 requests per second.

Your computer is a multicore 2-4GHz machine

It kind of does have to be one file, because foreign keys can't be enforced across files.

For a simple example, if you have one db file per user, as others recommend, how do you check that a username or an email is unique? Do you maintain a "global" db for fields with unique constraints? But then when those fields are updated you need to manage this at the application level, and atomicity can't really be guaranteed... Or is there another way?

The application I'm building (I haven't launched yet) has one database per customer that tracks the things the customer does in the application, plus a single database that tracks the metadata about all of the customers. The metadata changes very slowly (each customer will probably make at most one update per year), so there will be (almost) no write contention on the metadata database. Each customer has an id (a random 64-bit integer) which is used (in hexadecimal) as the filename for the individual customer database.

The total addressable market for my application is about 2,500 customers, and they won't use it at all in most weeks. I'm aware that this usage pattern is very uncommon, so I don't recommend what I'm doing as a general solution.

> The total addressable market for my application is about 2,500 customers, and they won't use it at all in most weeks. I'm aware that this usage pattern is very uncommon

Is it even?

I'm with you on this. I think this is a widely desirable pattern. If it's uncommon it's because of how poorly traditional dbms's address it.

I suspect as server-side sqlite matures, this pattern will explode in popularity.

I've done pretty much that, but without SQLite as the origin of that system predated SQLites first release, with millions of users. It works, and it scales. Since then I've increasingly preferred this pattern when it's been suitable, and as SQLite has grown in maturity, it increasingly is suitable.
The pattern I've used most often is to maintain a db/table of meta records which handles identity and transaction. Then, for each metadata row there is a SQLite database instance.

This is useful for storing instances of very complex type systems (e.g., ones with circular dependencies) and managing customer tenants.

Keep user information in one database, keep user _data_ in another.

It all depends on your app, really. For some apps the above won't make sense, but maybe it would make sense to keep one db per org, or something like that.

You can also use a different but similar tool e.g. use sqlite to keep user information, use duckdb to keep user data. I find this combo very effective since sqlite is best as fast btree based indexed operations and duckdb is best at multiple row based aggregate computations (get me user X vs get me all users with property Y).
For the one DB file per user question, I'll take a stab at a solution:

* Global store of user identifiers mapped to user IDs. Identifiers can change but the underlying user ID is immutable.

* One DB per user ID.

The per-user DB doesn't really need to know the user identifier, and at auth time you don't need to access the business logic, so cross-DB updates should not be a problem.

In the general case, similar logic applies: you can't partition just anywhere, but where you can partition is pretty well defined.

> Do you maintain a "global" db for fields with unique constraints?

Yes.

> But then when those fields are updated you need to manage this at the application level

Yep. BTW, that's not really so different from a traditional db. You're doing the same thing, just in a different way, and one you probably aren't used to, but it's equally valid.

In a traditional dbms you express this constraint in application code using the database's native DDL, loaded into to the db, and executed by the db server whenever a relevant data change is made.

With sqlite you'd express this constraint in application code using practically any language you want, loaded to your app execution environment, and executed by your app whenever a relevant data change is made.

With a traditional dbms you can get away with a bit less of an application data access layer because the built-in DDL can handle some things. But it can't handle that much, so you really want that application level layer anyway.

> and atomicity can't really be guaranteed

Right. That becomes a concern of the application (could be the data access layer, though, which I think you want whether we're talking about traditional db or sqlite).

As with many horizontal scaling designs we're taking about (1) choosing divisions to minimize the possibilities for inconsistencies; (2) eventual consistency.

In your specific example, I'd question whether that global data -- user name and email -- need to be in the per-user db at all. Links between databases might use uuids that don't change, which limits the tricky cases.

It most definitely does not. Yes, you will lose foreign keys across silos. But this will usually be one association, like a user id for example.

You are also thinking of this defensively, "x value cannot exist here if it doesn't exist in y store". Just think how you'd access this, and you'll realize at worst this creates a void of data which your app likely handles anyway.

If there are numerous associations required across these silos. You don't have the right silos yet.

> Be brave. Just try it.

This is the exact opposite attitude I would want to have when it comes to choosing the best way to store and manage my data. My data is actually important. I understand that might not be true for a lot of other people. I also have a lot of data, and non-trivial amounts of data has incredible inertia. It's very difficult to move and "redo" if you mess it up.

For all these reasons, I just use postgres.

On the flip side, sqlite is also a great choice in those circumstances. so regardless of attitude, outcome seems good to me.

incredibly well tested for scenarios where data loss is likely, code is in the public domain, available on all major operating systems, included in language standard libraries... it's one of the most ubiquitous pieces of software out there.

I'm not going to use a database that doesn't even have a proper date/time type and treats everything as a string.
ok, that seems like a bit of a moved goal post, but you do you.

is there any particular impact you see that having? Postgres has had it's share of data integrity issues due to things like locale differences between primary and replica, where strings would get sorted differently.

I'm not advocating for one over the other. I love both, but neither are perfect.

https://www.citusdata.com/blog/2020/12/12/dont-let-collation...

https://wiki.postgresql.org/wiki/Todo:ICU

Yes, we've used the Litestream + SQLite + Go webserver route in production for several years and while we've had to deal with all of these issues, on the whole I don't have a strong argument for going back and doing it differently.

I'm curious what issues the author had with high availability - I totally understand the issue in theory, but in practice our SQLite-backed service has had 6 minutes of downtime in the last 3 years when we were upgrading the host. That works out to being quite a few 9s.

I do agree with the article on a few points - distributed SQLite doesn't really seem all that worthwhile to me (yet, anyway).

It appears we are essentially stack-buddies (I use those exact technologies too). I have 0 issues with perf, and I love the hermetic unit tests that literally doesn’t even use disk (let alone networking).

However, I find it a bit annoying that I have to restore from backup for one-off queries and data exploration. Do you have some tooling or tricks to recommend?

We have some analytics queries powered by Airflow running on a separate box. We set up Litestream to replicate to that box via SFTP, and then dump changes hourly into Snowflake.
I'm curious what SQlite package do you use? Are you also using cgo or not? cgo has been a bit of pain point for me in the past with SQLite
Yeah cgo has been necessary for me, I forgot why but made that determination long ago. I have many pains with cross platform builds but never had an issue with cgo, and I’m targeting Android, iOS, macOS windows and two Linux flavors. I’m using the mattn/sqlite3 package.
> Removing the network latency is a massive gain, especially for small/medium apps, which is most sites.

No it's not. Network latency is negligible, assuming your database isn't halfway across the world or something.

And as soon as you have to scale to just two webservers, you've either got to implement a client-server network layer on top anyways, or else the complexity of a replication layer.

Just start with Postgres or MySQL in the first place. It's just as performant, just as easy to set up, and can actually scale as you need it. Don't be "brave" when there isn't even any payoff involved.

> Network latency is negligible

Yes but now you also have a network to maintain.

> Just start with Postgres or MySQL in the first place. It's just as performant, just as easy to set up

Yes it's easy to set up. But then you have to maintain it. It's another daemon that needs care.

You don't have a network to maintain. If you have a server, it's already on a network.

And you don't have to do any more maintenance on Postgres/MySQL than you have to do on SQLite. You can leave them running for years without touching them. Maybe you want to upgrade them, but maybe you want to upgrade SQLite too.

> You don't have a network to maintain. If you have a server, it's already on a network.

Well, I'm sure the db server has an IP address? Is it IPV4 or IPV6? How do you receive the IP address? Can it change or not? How do you secure the connection between your web server and the database? Should the database only accept network traffic from your web server or from the wide internet? If the former, you have to keep that in mind when adding a new web server. If the latter, make damn sure the credentials don't leak!

Congratulations, you now maintain a network!

You call that maintaining a network?

I call it five minutes of setup, if even.

And if you're dealing with SQLite, are you making backups somewhere on a network? Securely? With credentials? Where do you keep the credentials for your webserver itself? If you're running a website, you're dealing with networking and credentials from the start.

Yes, it's five minutes to set up now. But it's an extra liability to have to maintain the network in the future, when things inevitably change.
What's difficult about schema migration? Not being snarky, genuinely curious :)
> What's difficult about schema migration? Not being snarky, genuinely curious :)

SQLite's ALTER TABLE support isn't on par with other dbs. That's primarily (perhaps solely) because it stores table definitions as their raw SQL, rather than in some "idealized form." When SQLite first needs to understand a table's structure, it reads the original SQL which was used to create that table. Retroactively modifying arbitrary string input from the user (in this case SQL code) is a minefield.

Edit: see <https://www.sqlite.org/lang_altertable.html> in particular the final section, currently titled "Why ALTER TABLE is such a problem for SQLite"

Interesting, thanks for sharing! This sounds like a nightmare haha

What kinds of problems does this cause in practice? Do you have any examples? Not questioning you, just curious to learn more.

> Do you have any examples?

Not personally, because i've never been affected by it, but it's a topic now and again in the SQLite forum (where i'm a daily visitor).

honest but dumb question(s): how does sqlite handle "connections" from multiple instances of an application? will it constrain to single-instances (or single-machine) running the application?
The historical answer has generally been: it doesn't (at least, not well). The traditional recommendations were not to open multiple connections to the same file, and especially not to share the same file over a network filesystem. This remains the general advice, as far as I can tell, but it may be out of date against some newer features (like WAL mode) and better filesystems (like NFSv4).
You are still correct for network filesystems, but there are no issues opening multiple connections on the same host (even across OCI container boundaries).
There are no data corruption risks in general for multiple processes on the same local filesystem, but the lock obtained by write operations is pretty aggressive. Multiple readers are okay but a single writer will cause transient errors in all other concurrent read and write operations (edit: though you can turn this into just blocking instead with pragma busy_timeout).
> Multiple readers are okay but a single writer will cause transient errors in all other concurrent read and write operations

While in WAL mode, readers and writers don't block one another.

> WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently. https://www.sqlite.org/wal.html

> Multiple readers are okay but a single writer will cause transient errors in all other concurrent read and write operations

They don't cause "transient errors", they cause transaction failures, which you should be handling by retrying the transaction anyway.

If you could retry it, then obviously the error was transient. I don't know what the issue is with me describing it that way.

But you generally won't have to write retry logic if you have the file open in only one place. The other issues that can arise, like constraint violations or no space left on disk, usually aren't transient and thus can't be fixed with retry anyway.

As long as the connections are from the same machine (local filesystem), it's fine.

Reads are concurrent, writes block reads (and other writes) while they write. (There's also something called WAL mode that enables reads during a write, which uses multiple files for each database.)

They don't just get blocked, they error out. This is a well defined error condition, so you can detect it easily, but it's not going to happen if you stick to the one connection rule. This means you have to intentionally think about handling it everywhere if you want to use another connection anywhere.

Edit: looks like you can set up blocking on the client using pragma busy_timeout [1] so the above is only true in default configuration

[1]: https://www.sqlite.org/pragma.html#pragma_busy_timeout

> This means you have to intentionally think about handling it everywhere if you want to use another connection anywhere.

so things like mutexes (one process, multiple threads) or mutexes over shared memory (multiple processes, multiple threads) ?

sounds complicated (and error-prone).

but at the end of the day i guess that kind of juggling has to be done somewhere, either a database server does it for you or you do it yourself in the database client.

i wouldn't mind sqlite on the server, but i'm not sure I would want to trade zero-downtime-deployments (spin up new version, move load from old to new, spin down old version) for single-file ease of operations.

You don't have to really think about it anywhere.

It's handled by the file lock. You don't need to track anything.

Yes there is a timeout value that will return an error if the database file hasn't unlocked in time. In Python the default timeout is 5 seconds, which is plenty if your queries execute on the order of milliseconds or tens of milliseconds.

We do zero-downtime deployments with a single Docker volume containing the db. Spin up a container running the new code, wait til it's healthy, then kill the old container.
You definitely want WAL mode (unless you’re accessing SQLite over NFS). It’s only not the default for backwards compatibility. I’d enable:

PRAGMA journal_mode = WAL; PRAGMA foreign_keys = 1; PRAGMA mmap_size = 1099511627776;

Unless you know a reason that you can’t (like NFS or using a 32-bit cpu).

> Unless you know a reason that you can’t (like NFS or using a 32-bit cpu).

interesting, why there's that limitation when nfs and a 32-bit cpu are involved?

WAL mode requires shared memory, so doesn’t work over the network.

mmap_size = 1TiB requires 1TiB of address space per connection. This is fine on 64-bit, unless you have millions of connections. Doesn’t work with 32-bit (4GiB) address space.

You want mmap_size to be greater than the size of your database. If you know your database will be small then you could set it to a smaller number, but you could still become address space constrained with many connections.

You can have as many connections from as many processes as you want.