| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lmm 1423 days ago

I've never understood this logic for webapps. If you're building a web application, congratulations, you're building a distributed system, you don't get a choice. You can't actually use transactional integrity or ACID compliance because you've got to send everything to and from your users via HTTP request/response. So you end up paying all the performance, scalability, flexibility, and especially reliability costs of an RDBMS, being careful about how much data you're storing, and getting zilch for it, because you end up building a system that's still last-write-wins and still loses user data whenever two users do anything at the same time (or you build your own transactional logic to solve that - exactly the same way as you would if you were using a distributed datastore).

Distributed systems can also make efficient use of cache, in fact they can do more of it because they have more of it by having more nodes. If you get your dataflow right then you'll have performance that's as good as a monolith on a tiny dataset but keep that performance as you scale up. Not only that, but you can perform a lot better than an ACID system ever could, because you can do things like asynchronously updating secondary indices after the data is committed. But most importantly you have easy failover from day 1, you have easy scaling from day 1, and you can just not worry about that and focus on your actual business problem.

Relational databases are largely a solution in search of a problem, at least for web systems. (They make sense as a reporting datastore to support ad-hoc exploratory queries, but there's never a good reason to use them for your live/"OLTP" data).

2 comments

Tainnor 1423 days ago

I really don't understand how anything of what you wrote follows from the fact that you're building a web-app. Why do you lose user data when two users do anything at the same time? That has never happened to me with any RDBMS.

And why would HTTP requests prevent me from using transactional logic? If a user issues a command such as "copy this data (a forum thread, or a Confluence page, or whatever) to a different place" and that copy operation might actually involve a number of different tables, I can use a transaction and make sure that the action either succeeds fully or is rolled back in case of an error; no extra logic required.

I couldn't disagree more with your conclusion even if I wanted to. Relational databases are great. We should use more of them.

link

lmm 1423 days ago

> I really don't understand how anything of what you wrote follows from the fact that you're building a web-app. Why do you lose user data when two users do anything at the same time? That has never happened to me with any RDBMS.

> And why would HTTP requests prevent me from using transactional logic? If a user issues a command such as "copy this data (a forum thread, or a Confluence page, or whatever) to a different place" and that copy operation might actually involve a number of different tables, I can use a transaction and make sure that the action either succeeds fully or is rolled back in case of an error; no extra logic required.

Sure, if you can represent what the user wants to do as a "command" like that, that doesn't rely on a particular state of the world, then you're fine. Note that this is also exactly the case that an eventually consistent event-sourcing style system will handle fine.

The case where transactions would actually be useful is the case where a user wants to read something and modify something based on what they read. But you can't possibly do that over the web, because they read the data in one request and write it in another request that may never come. If two people try to edit the same wiki page at the same time, either one of them loses their data, or you implement some kind of "userspace" reconciliation logic - but database transactions can't help you with that. If one user tries to make a new post in a forum thread at the same time as another user deletes that thread, probably they get an error that throws away all their data, because storing it would break referential integrity.

link

Tainnor 1423 days ago

> Sure, if you can represent what the user wants to do as a "command" like that, that doesn't rely on a particular state of the world, then you're fine. Note that this is also exactly the case that an eventually consistent event-sourcing style system will handle fine.

Yes, but the event-sourcing system (or similar variants, such as CRDTs) is much more complex. It's true that it buys you some things (like the ability to roll back to specific versions), but you have to ask yourself whether you really need that for a specific piece of data.

(And even if you use event sourcing, if you have many events, you probably won't want to replay all of them, so you'll maybe want to store the result in a database, in which case you can choose a relational one.)

> If two people try to edit the same wiki page at the same time, either one of them loses their data, or you implement some kind of "userspace" reconciliation logic - but database transactions can't help you with that.

Yes, but

a) that's simply not a problem in all situations. People will generally not update their user profile concurrently with other users, for example. So it only applies to situations where data is truly shared across multiple users, and it doesn't make sense to build a complex system only for these use cases,

b) the problem of users overwriting other users' data is inherent to the problem domain; you will, in the end, have to decide which version is the most recent regardless of which technology you use. The one thing that evens etc. buy you is a version history (which btw can also be implemented with a RDBMS), but if you want to expose that in the UI so the user can go back, you have to do additional work anyway - it doesn't come for free.

c) Meanwhile, the RDBMS will at least guarantee that the data is always in a consistent state. Users overwriting other users' data is unfortunate, but corrupted data is worse.

d) You can solve the "concurrent modification" issue in a variety of ways, depending on the frequency of the problem, without having to implement a complex event-sourced system. For example, a lock mechanism is fairly easy to implement and useful in many cases. You could also, for example, hash the contents of what the user is seeing and reject the change if there is a mismatch with the current state (I've never tried it, but it should work in theory).

I don't wish to claim that a relational database solves all transactionality (and consistency) problems, but they certainly solve some of them - so throwing them out because of that is a bit like "tests don't find all bugs, so we don't write them anymore".

link

lmm 1423 days ago

> Yes, but the event-sourcing system (or similar variants, such as CRDTs) is much more complex.

It's really not. An RDBMS usually contains all of the same stuff underneath the hood (MVCC etc.), it just tries to paper over it and present the illusion of a single consistent state of the world, and unfortunately that ends up being leaky.

> a) that's simply not a problem in all situations. People will generally not update their user profile concurrently with other users, for example. So it only applies to situations where data is truly shared across multiple users,

Sure - but those situations are ipso facto situations where you have no need for transactions.

> b) the problem of users overwriting other users' data is inherent to the problem domain; you will, in the end, have to decide which version is the most recent regardless of which technology you use. The one thing that evens etc. buy you is a version history (which btw can also be implemented with a RDBMS), but if you want to expose that in the UI so the user can go back, you have to do additional work anyway - it doesn't come for free.

True, but what does come for free is thinking about it when you're designing your dataflow. Using an event sourcing style forces you to confront the idea that you're going to have concurrent updates going on, early enough in the process that you naturally design your data model to handle it, rather than imagining that you can always see "the" current state of the world.

> c) Meanwhile, the RDBMS will at least guarantee that the data is always in a consistent state. Users overwriting other users' data is unfortunate, but corrupted data is worse.

I'm not convinced, because the way it accomplishes that is by dropping "corrupt" data on the floor. If user A tries to save new post B in thread C, but at the same time user D has deleted that thread, then in a RDBMS where you're using a foreign key the only thing you can do is error and never save the content of post B. In an event sourcing system you still have to deal with the fact that the post belongs in a nonexistent thread eventually, but you don't start by losing the user's data, and it's very natural to do something like mark it as an orphaned post that the user can still see in their own post history, which is probably what you want. (Of course you can achieve that in the RDBMS approach, but it tends to involve more complex logic, giving up on foreign keys and accepting tha you have to solve the same data integrity problems as a non-ACID system, or both).

> d) You can solve the "concurrent modification" issue in a variety of ways, depending on the frequency of the problem, without having to implement a complex event-sourced system. For example, a lock mechanism is fairly easy to implement and useful in many cases. You could also, for example, hash the contents of what the user is seeing and reject the change if there is a mismatch with the current state (I've never tried it, but it should work in theory).

That sounds a whole lot more complex than just sticking it an event sourcing system. Especially when the problem is rare, it's much better to find a solution where the correct behaviour naturally arises in that case, than implement some kind of ad-hoc special case workaround that will never be tested as rigorously as your "happy path" case.

link

Tainnor 1423 days ago

> It's really not. An RDBMS usually contains all of the same stuff underneath the hood (MVCC etc.), it just tries to paper over it and present the illusion of a single consistent state of the world, and unfortunately that ends up being leaky.

There's nothing leaky about it. Relational algebra is a well-understood mathematical abstraction. Meanwhile, I can just set up postgres and an ORM (or something more lightweight, if I prefer) and I'm good to go - there's thousands of examples of how to do that. Event-sourced architectures have decidedly more pitfalls. If my event handling isn't commutative, associative and idempotent I'm either losing out on concurrency benefits (because I'm asking my queue to synchronise messages) or I'll get undefined behaviour.

There's really probably no scenario in which implementing a CRUD app with a relational database isn't going to take significantly less time than some event sourced architecture.

> Sure - but those situations are ipso facto situations where you have no need for transactions.

> Using an event sourcing style forces you to confront the idea that you're going to have concurrent updates going on

There are tons of examples like backoffice tools (where people might work in shifts or on different data sets), delivery services, language learning apps, flashcard apps, government forms, todo list and note taking apps, price comparison services, fitness trackers, banking apps, and so on, where some or even most of the data is not usually concurrently edited by multiple users, but where you still will probably have consistency guarantees across multiple tables.

Yes, if you're building Twitter, by all means use event sourcing or CRDTs or something. But we're not all building Twitter.

> If user A tries to save new post B in thread C, but at the same time user D has deleted that thread, then in a RDBMS where you're using a foreign key the only thing you can do is error and never save the content of post B.

I don't think I've ever seen a forum app that doesn't just "throw away" the user comment in such a case, in the sense that it will not be stored in the database. Sure, you might have some event somewhere, but how is that going to help the user? Should they write a nice email and hope that some engineer with too much time is going to find that event somewhere buried deep in the production infrastructure and then ... do what exactly with it?

This is a solution in search of a problem. Instead, you should design your UI such that the comment field is not cleared upon a failed submission, like any reasonable forum software. Then the user who really wants to save their ramblings can still do so, without the need of any complicated event-sourcing mechanism. And in most forums, threads are rarely deleted, only locked (unless it's outright spam/illegal content/etc.)

(Also, there are a lot of different ways how things can be designed when you're using an RDBMS. You can also implement soft deletes (which many applications do) and then you won't get any foreign key errors. In that way, you can still display "orphaned" comments that belong to deleted threads, if you so wish (have never seen a forum do that, though). Recovering a soft deleted thread is probably also an order of magnitude easier than trying to replay it from some events. Yes, soft deletes involve other tradeoffs - but so does every architecture choice.)

> That sounds a whole lot more complex than just sticking it an event sourcing system. Especially when the problem is rare, it's much better to find a solution where the correct behaviour naturally arises in that case.

I really disagree that a locking mechanism is more difficult than an event sourced system. The mechanism doesn't have to be perfect. If a user loses the lock because they haven't done anything in half an hour, then in many cases that's completely acceptable. Such a system is not hard to implement (I could just use a redis store with expiring entries) and it will also be much easier to understand, since you now don't have to track the flow of your business logic across multiple services.

I also don't know why you think that your event-sourced system will be better tested. Are you going to test for the network being unreliable, messages getting lost or being delivered out of order, and so on? If so, you can also afford to properly test a locking mechanism (which can be readily done in a monolith, maybe with an additional redis dependency, and is therefore more easily testable than some event-based logic that spans multiple services).

And in engineering, there are rarely "natural" solutions to problems. There are specific problems and they require specific solutions. Distributed systems, event sourcing etc. are great where they're called for. In many cases, they're simply not.

link

BenoitEssiambre 1423 days ago

Http requests work great with relational dbs. This is not UDP. If the TCP connection is broken, an operation will either have finished or stopped and rolledback atomically and unless you've placed unneeded queues in there, you should know of success immediately.

When you get the http response, you will know the data is fully committed, data that uses it can be refreshed immediately and is accessible to all other systems immediately so you can perform next steps relying on those hard guarantees. Behind the http request, a transaction can be opened to do a bunch of stuff including API calls to other systems if needed and commit the results as an atomic transaction. There are tons of benefit using it with http.

link

lmm 1423 days ago

But you can't do interaction between the two ends of a HTTP request. The caller makes an inert request, whatever processing happens downstream of that might as well be offline because it's not and can never be interactive within a single transaction.

link

Tainnor 1422 days ago

Now you're shifting the goalposts. You started out by claiming that web apps can't be transactional, now you've switched to saying they can't be transactional if they're "interactive" (by which you presumably mean transactions that span multiple HTTP requests).

Of course, that's a very particular demand, one that doesn't necessarily apply to many applications.

And even then, depending on the use case, there are relatively straightforward ways of implementing that too: For example, if you build up all the data on the client (potentially by querying the server, with some of the partial data, for the next form page, or whatever) and submit it all in one single final request.

link