Hacker News new | ask | show | jobs
by BenoitEssiambre 1420 days ago
I'm glad this is becoming conventional wisdom. I used to argue this in these pages a few years ago and would get downvoted below the posts telling people to split everything into microservices separated by queues (although I suppose it's making me lose my competitive advantage when everyone else is building lean and mean infrastructure too).

In my mind, reasons involve keeping transactional integrity, ACID compliance, better error propagation, avoiding the hundreds of impossible to solve roadblocks of distributed systems (https://groups.csail.mit.edu/tds/papers/Lynch/MIT-LCS-TM-394...).

But also it is about pushing the limits of what is physically possible in computing. As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.

Physical efficiency is about keeping data close to where it's processed. Monoliths can make much better use of L1, L2, L3, and ram caches than distributed systems for speedups often in the order of 100X to 1000X.

Sure it's easier to throw more hardware at the problem with distributed systems but the downsides are significant so be sure you really need it.

Now there is a corollary to using monoliths. Since you only have one db, that db should be treated as somewhat sacred, you want to avoid wasting resources inside it. This means being a bit more careful about how you are storing things, using the smallest data structures, normalizing when you can etc. This is not to save disk, disk is cheap. This is to make efficient use of L1,L2,L3 and ram.

I've seen boolean true or false values saved as large JSON documents. {"usersetting1": true, "usersetting2":fasle "setting1name":"name" etc.} with 10 bits of data ending up as a 1k JSON document. Avoid this! Storing documents means, the keys, the full table schema is in every row. It has its uses but if you can predefine your schema and use the smallest types needed, you are gaining much performance mostly through much higher cache efficiency!

6 comments

> I'm glad this is becoming conventional wisdom

It's not though. You're just seeing the most popular opinion on HN.

In reality it is nuanced like most real-world tech decisions are. Some use cases necessitate a distributed or sharded database, some work better with a single server and some are simply going to outsource the problem to some vendor.

> outsource the problem to some vendor

At least that way you can be certain of failure.

Exactly. The HN crowd is obsessed with minimalism and reducing "bloat".

It has become a cult, where availability and scale requirements are apparently fiction. "You are not FAANG, you don't have these requirements."

> I'm glad this is becoming conventional wisdom

My hunch is that computers caught up. Back in the early 2000's horizontal scaling was the only way. You simply couldn't handle even reasonably mediocre loads on a single machine.

As computing becomes cheaper, horizontal scaling is starting to look more and more like unnecessary complexity for even surprisingly large/popular apps.

I mean you can buy a consumer off-the-shelf machine with 1.5TB of memory these days. 20 years ago, when microservices started gaining popularity, 1.5TB RAM in a single machine was basically unimaginable.

Honestly from my perspective it feels like microservices arose strongly in popularity precisely when it was becoming less necessary. In particular the mass adoption of SSD storage massively changed the nature of the game, but awareness of that among regular developers seemed not as pervasive as it should have been.
'over the wire' is less obvious than it used to be.

If you're in k8s pod, those calls are really kernel calls. Sure you're serializing and process switching where you could be just making a method call, but we had to do something.

I'm seeing less 'balls of mud' with microservices. Thats not zero balls of mud. But its not a given for almost every code base I wander into.

To clarify, I think stateless microservices are good. It's when you have too many DBs (and sometimes too many queues) that you run into problems.

A single instance of PostgreSQL is, in most situations, almost miraculously effective at coordinating concurrent and parallel state mutations. To me that's one of the most important characteristic of an RDBMS. Storing data is a simpler secondary problem. Managing concurrency is the hard problem that I need most help with from my DB and having a monolithic DB enables the coordination of everything else including stateless peripheral services without resulting in race conditions, conflicts or data corruption.

SQL is the most popular mostly functional language. This might be because managing persistent state and keeping data organized and low entropy, is where you get the most benefit from using a functional approach that doesn't add more state. This adds to the effectiveness of using a single transactional DB.

I must admit that even distributed DBs, like Cockroach and Yugabyte have recognized this and use the PostgreSQL syntax and protocol. This is good though, it means that if you really need to scale beyond PostgreSQL, you have PostgreSQL compatible options.

> I'm seeing less 'balls of mud' with microservices.

The parallel to "balls of mud" with microservices is tiny services that seem almost devoid of any business logic and all the actual business logic is encapsulated in the calls between different services, lambda functions, and so on.

That's quite nightmarish from a maintenance perspective too, because now it's almost impossible to look at the system from the outside and understand what it's doing. It also means that conventional tooling can't help you anymore: you don't get compiler errors if your lambda function calls an endpoint that doesn't exist anymore.

Big balls of mud are horrible (I'm currently working with a big ball of mud monolith, I know what I'm talking about), but you can create a different kind of mess with microservices too. Then there all the other problems, such as operational complexity, or "I now need to update log4j across 30 services".

In the end, a well-engineered system needs disciple and architectural skills, as well as a healthy engineering culture where tech debt can be paid off, regardless of whether it's a monolith, a microservice architecture or something in between.

> I'm seeing less 'balls of mud' with microservices. Thats not zero balls of mud.

They are probably younger. Give them time :P

>"I'm glad this is becoming conventional wisdom. "

Yup, this is what I've always done and it works wonders. Since I do not have bosses, just a clients I do not give a flying fuck about latest fashion and do what actually makes sense for me and said clients.

I've never understood this logic for webapps. If you're building a web application, congratulations, you're building a distributed system, you don't get a choice. You can't actually use transactional integrity or ACID compliance because you've got to send everything to and from your users via HTTP request/response. So you end up paying all the performance, scalability, flexibility, and especially reliability costs of an RDBMS, being careful about how much data you're storing, and getting zilch for it, because you end up building a system that's still last-write-wins and still loses user data whenever two users do anything at the same time (or you build your own transactional logic to solve that - exactly the same way as you would if you were using a distributed datastore).

Distributed systems can also make efficient use of cache, in fact they can do more of it because they have more of it by having more nodes. If you get your dataflow right then you'll have performance that's as good as a monolith on a tiny dataset but keep that performance as you scale up. Not only that, but you can perform a lot better than an ACID system ever could, because you can do things like asynchronously updating secondary indices after the data is committed. But most importantly you have easy failover from day 1, you have easy scaling from day 1, and you can just not worry about that and focus on your actual business problem.

Relational databases are largely a solution in search of a problem, at least for web systems. (They make sense as a reporting datastore to support ad-hoc exploratory queries, but there's never a good reason to use them for your live/"OLTP" data).

I really don't understand how anything of what you wrote follows from the fact that you're building a web-app. Why do you lose user data when two users do anything at the same time? That has never happened to me with any RDBMS.

And why would HTTP requests prevent me from using transactional logic? If a user issues a command such as "copy this data (a forum thread, or a Confluence page, or whatever) to a different place" and that copy operation might actually involve a number of different tables, I can use a transaction and make sure that the action either succeeds fully or is rolled back in case of an error; no extra logic required.

I couldn't disagree more with your conclusion even if I wanted to. Relational databases are great. We should use more of them.

> I really don't understand how anything of what you wrote follows from the fact that you're building a web-app. Why do you lose user data when two users do anything at the same time? That has never happened to me with any RDBMS.

> And why would HTTP requests prevent me from using transactional logic? If a user issues a command such as "copy this data (a forum thread, or a Confluence page, or whatever) to a different place" and that copy operation might actually involve a number of different tables, I can use a transaction and make sure that the action either succeeds fully or is rolled back in case of an error; no extra logic required.

Sure, if you can represent what the user wants to do as a "command" like that, that doesn't rely on a particular state of the world, then you're fine. Note that this is also exactly the case that an eventually consistent event-sourcing style system will handle fine.

The case where transactions would actually be useful is the case where a user wants to read something and modify something based on what they read. But you can't possibly do that over the web, because they read the data in one request and write it in another request that may never come. If two people try to edit the same wiki page at the same time, either one of them loses their data, or you implement some kind of "userspace" reconciliation logic - but database transactions can't help you with that. If one user tries to make a new post in a forum thread at the same time as another user deletes that thread, probably they get an error that throws away all their data, because storing it would break referential integrity.

> Sure, if you can represent what the user wants to do as a "command" like that, that doesn't rely on a particular state of the world, then you're fine. Note that this is also exactly the case that an eventually consistent event-sourcing style system will handle fine.

Yes, but the event-sourcing system (or similar variants, such as CRDTs) is much more complex. It's true that it buys you some things (like the ability to roll back to specific versions), but you have to ask yourself whether you really need that for a specific piece of data.

(And even if you use event sourcing, if you have many events, you probably won't want to replay all of them, so you'll maybe want to store the result in a database, in which case you can choose a relational one.)

> If two people try to edit the same wiki page at the same time, either one of them loses their data, or you implement some kind of "userspace" reconciliation logic - but database transactions can't help you with that.

Yes, but

a) that's simply not a problem in all situations. People will generally not update their user profile concurrently with other users, for example. So it only applies to situations where data is truly shared across multiple users, and it doesn't make sense to build a complex system only for these use cases,

b) the problem of users overwriting other users' data is inherent to the problem domain; you will, in the end, have to decide which version is the most recent regardless of which technology you use. The one thing that evens etc. buy you is a version history (which btw can also be implemented with a RDBMS), but if you want to expose that in the UI so the user can go back, you have to do additional work anyway - it doesn't come for free.

c) Meanwhile, the RDBMS will at least guarantee that the data is always in a consistent state. Users overwriting other users' data is unfortunate, but corrupted data is worse.

d) You can solve the "concurrent modification" issue in a variety of ways, depending on the frequency of the problem, without having to implement a complex event-sourced system. For example, a lock mechanism is fairly easy to implement and useful in many cases. You could also, for example, hash the contents of what the user is seeing and reject the change if there is a mismatch with the current state (I've never tried it, but it should work in theory).

I don't wish to claim that a relational database solves all transactionality (and consistency) problems, but they certainly solve some of them - so throwing them out because of that is a bit like "tests don't find all bugs, so we don't write them anymore".

> Yes, but the event-sourcing system (or similar variants, such as CRDTs) is much more complex.

It's really not. An RDBMS usually contains all of the same stuff underneath the hood (MVCC etc.), it just tries to paper over it and present the illusion of a single consistent state of the world, and unfortunately that ends up being leaky.

> a) that's simply not a problem in all situations. People will generally not update their user profile concurrently with other users, for example. So it only applies to situations where data is truly shared across multiple users,

Sure - but those situations are ipso facto situations where you have no need for transactions.

> b) the problem of users overwriting other users' data is inherent to the problem domain; you will, in the end, have to decide which version is the most recent regardless of which technology you use. The one thing that evens etc. buy you is a version history (which btw can also be implemented with a RDBMS), but if you want to expose that in the UI so the user can go back, you have to do additional work anyway - it doesn't come for free.

True, but what does come for free is thinking about it when you're designing your dataflow. Using an event sourcing style forces you to confront the idea that you're going to have concurrent updates going on, early enough in the process that you naturally design your data model to handle it, rather than imagining that you can always see "the" current state of the world.

> c) Meanwhile, the RDBMS will at least guarantee that the data is always in a consistent state. Users overwriting other users' data is unfortunate, but corrupted data is worse.

I'm not convinced, because the way it accomplishes that is by dropping "corrupt" data on the floor. If user A tries to save new post B in thread C, but at the same time user D has deleted that thread, then in a RDBMS where you're using a foreign key the only thing you can do is error and never save the content of post B. In an event sourcing system you still have to deal with the fact that the post belongs in a nonexistent thread eventually, but you don't start by losing the user's data, and it's very natural to do something like mark it as an orphaned post that the user can still see in their own post history, which is probably what you want. (Of course you can achieve that in the RDBMS approach, but it tends to involve more complex logic, giving up on foreign keys and accepting tha you have to solve the same data integrity problems as a non-ACID system, or both).

> d) You can solve the "concurrent modification" issue in a variety of ways, depending on the frequency of the problem, without having to implement a complex event-sourced system. For example, a lock mechanism is fairly easy to implement and useful in many cases. You could also, for example, hash the contents of what the user is seeing and reject the change if there is a mismatch with the current state (I've never tried it, but it should work in theory).

That sounds a whole lot more complex than just sticking it an event sourcing system. Especially when the problem is rare, it's much better to find a solution where the correct behaviour naturally arises in that case, than implement some kind of ad-hoc special case workaround that will never be tested as rigorously as your "happy path" case.

> It's really not. An RDBMS usually contains all of the same stuff underneath the hood (MVCC etc.), it just tries to paper over it and present the illusion of a single consistent state of the world, and unfortunately that ends up being leaky.

There's nothing leaky about it. Relational algebra is a well-understood mathematical abstraction. Meanwhile, I can just set up postgres and an ORM (or something more lightweight, if I prefer) and I'm good to go - there's thousands of examples of how to do that. Event-sourced architectures have decidedly more pitfalls. If my event handling isn't commutative, associative and idempotent I'm either losing out on concurrency benefits (because I'm asking my queue to synchronise messages) or I'll get undefined behaviour.

There's really probably no scenario in which implementing a CRUD app with a relational database isn't going to take significantly less time than some event sourced architecture.

> Sure - but those situations are ipso facto situations where you have no need for transactions.

> Using an event sourcing style forces you to confront the idea that you're going to have concurrent updates going on

There are tons of examples like backoffice tools (where people might work in shifts or on different data sets), delivery services, language learning apps, flashcard apps, government forms, todo list and note taking apps, price comparison services, fitness trackers, banking apps, and so on, where some or even most of the data is not usually concurrently edited by multiple users, but where you still will probably have consistency guarantees across multiple tables.

Yes, if you're building Twitter, by all means use event sourcing or CRDTs or something. But we're not all building Twitter.

> If user A tries to save new post B in thread C, but at the same time user D has deleted that thread, then in a RDBMS where you're using a foreign key the only thing you can do is error and never save the content of post B.

I don't think I've ever seen a forum app that doesn't just "throw away" the user comment in such a case, in the sense that it will not be stored in the database. Sure, you might have some event somewhere, but how is that going to help the user? Should they write a nice email and hope that some engineer with too much time is going to find that event somewhere buried deep in the production infrastructure and then ... do what exactly with it?

This is a solution in search of a problem. Instead, you should design your UI such that the comment field is not cleared upon a failed submission, like any reasonable forum software. Then the user who really wants to save their ramblings can still do so, without the need of any complicated event-sourcing mechanism. And in most forums, threads are rarely deleted, only locked (unless it's outright spam/illegal content/etc.)

(Also, there are a lot of different ways how things can be designed when you're using an RDBMS. You can also implement soft deletes (which many applications do) and then you won't get any foreign key errors. In that way, you can still display "orphaned" comments that belong to deleted threads, if you so wish (have never seen a forum do that, though). Recovering a soft deleted thread is probably also an order of magnitude easier than trying to replay it from some events. Yes, soft deletes involve other tradeoffs - but so does every architecture choice.)

> That sounds a whole lot more complex than just sticking it an event sourcing system. Especially when the problem is rare, it's much better to find a solution where the correct behaviour naturally arises in that case.

I really disagree that a locking mechanism is more difficult than an event sourced system. The mechanism doesn't have to be perfect. If a user loses the lock because they haven't done anything in half an hour, then in many cases that's completely acceptable. Such a system is not hard to implement (I could just use a redis store with expiring entries) and it will also be much easier to understand, since you now don't have to track the flow of your business logic across multiple services.

I also don't know why you think that your event-sourced system will be better tested. Are you going to test for the network being unreliable, messages getting lost or being delivered out of order, and so on? If so, you can also afford to properly test a locking mechanism (which can be readily done in a monolith, maybe with an additional redis dependency, and is therefore more easily testable than some event-based logic that spans multiple services).

And in engineering, there are rarely "natural" solutions to problems. There are specific problems and they require specific solutions. Distributed systems, event sourcing etc. are great where they're called for. In many cases, they're simply not.

Http requests work great with relational dbs. This is not UDP. If the TCP connection is broken, an operation will either have finished or stopped and rolledback atomically and unless you've placed unneeded queues in there, you should know of success immediately.

When you get the http response, you will know the data is fully committed, data that uses it can be refreshed immediately and is accessible to all other systems immediately so you can perform next steps relying on those hard guarantees. Behind the http request, a transaction can be opened to do a bunch of stuff including API calls to other systems if needed and commit the results as an atomic transaction. There are tons of benefit using it with http.

But you can't do interaction between the two ends of a HTTP request. The caller makes an inert request, whatever processing happens downstream of that might as well be offline because it's not and can never be interactive within a single transaction.
Now you're shifting the goalposts. You started out by claiming that web apps can't be transactional, now you've switched to saying they can't be transactional if they're "interactive" (by which you presumably mean transactions that span multiple HTTP requests).

Of course, that's a very particular demand, one that doesn't necessarily apply to many applications.

And even then, depending on the use case, there are relatively straightforward ways of implementing that too: For example, if you build up all the data on the client (potentially by querying the server, with some of the partial data, for the next form page, or whatever) and submit it all in one single final request.

>As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.

Even accounting for CDNs, a distributed system is inherently more capable of bringing data closer to geographically distributed end users, thus lowering latency.