Hacker News new | ask | show | jobs
by natdempk 625 days ago
Pretty great advice!

I think the one thing you can run into that is hard is once you want to support different datasets that fall outside the scope of a transaction (think events/search/derived-data, anything that needs to read/write to a system that is not your primary transactional DB) you probably do want some sort of event bus/queue type thing to get eventual consistency across all the things. Otherwise you just end up in impossible situations when you try to manage things like doing a DB write + ES document update. Something has to fail and then your state is desynced across datastores and you're in velocity/bug hell. The other side of this though is once you introduce the event bus and transactional-outbox or whatever, you then have a problem of writes/updates happening and not being reflected immediately. I think the best things that solve this problem are stuff like Meta's TAO that combines these concepts, but no idea what is available to the mere mortals/startups to best solve these types of problems. Would love to know if anyone has killer recommendations here.

1 comments

I think the question is if you need the entire system to be strongly consistent, or just the core of it?

To use ElasticSearch as an example: do you need to add the complexity of keeping the index up to date in realtime, or can you live with periodic updates for search or a background job for it?

As long as your primary DB is the source of truth, you can use that to bring other less critical stores up to date outside of the context of an API request.

Well, the problem you run into is that you kind of want different datastores for different use-cases. For example search vs. specific page loads, and you want to try and make both of those consistent, but you don't have a single DB that can serve both use-cases (often times primary DB + ElasticSearch for example). If you don't keep them consistent, you have user-facing bugs where a user can update a record but not search for it immediately, or if you try to load everything from ES to provide consistent views to a user, then updates can disappear on refresh. Or if you try to write to both SQL + ES in an API request, they can desync on failure writing to one or the other. The problem is even less the complexity of keeping the index up to date in realtime, and more that the ES index isn't even consistent with the primary DB, and to a user they are just different parts of your app that kinda seem a little broken in subtle ways inconsistently. It would be great to be able to have everything present a consistent view to users, that updates together on-write.
The way I solved it once was trying to update ES synchronously and if it failed or timeouted - queue event to index the doc. Timeout wasn’t an issue, because double update wasn’t harmful.
In instances like that I tend to push back on the requirement, for example with this classic DB + Elasticsearch case:

1. How often is a user going to perform an update and then search for the exact same thing immediately after?

2. Suppose they did: if elasticsearch was updated in the background, is the queue/worker running fast enough such that the user won't even notice a latency of a second or two max?

It really depends on what you're doing, because if Elasticsearch is operating as its own source of truth with data that the primary DB doesn't have, then yeah, you're going to have trouble keeping both strongly consistent in a transactional manner without layering on complexity (like sagas with transactions and compensations). But if it's merely a search engine on top of your source of truth (for example, you search ES to get a list of primary keys and then fetch all the data from the DB), you've got some breathing room.

I mean, we're talking plucky upstart here and not enterprise FAANG, so there's definitely a case for 'less is more'.

I think a different framing for the question might be more helpful. What is your overall goal? You cannot have everything. In fact, if you try to have everything, you will get nothing.

I would say that 99% of time the implicit goal is to cut down development time. And the best way to cut development time on long-term is to cut down complexity.

To cut down complexity, we should avoid complex problems, use existing solutions to solve them or at least be able to contain them. Sometimes, the price is that you need to solve some easier problems yourself.

For example, microservice architectures promise that you need less coordination between teams, because parts of the systems can be deployed independently. The price is that you cannot use database transactions to guarantee integrity.

I think data integrity is almost always much more important problem to solve, partly because it is so difficult to solve by yourself. Actually it is often so difficult that most people just ignore it.

For example, if you adopt microservices architecture, you often just ignore data integrity, and call your system "eventually consistent". Practically this means that you push the data integrity problems to the sink system.

It is better to think of data integrity as a meta-feature, rather than a feature. Having data integrity helps you in making other features of your system more simple. For example, migrating schema changes in your system is much more manageable if you use a database which can handle the migration within a transaction.

In your example, there are various ways where system can be left in an inconsistent state after a crash, even if the database is the "source of truth". For example, do you always reconstruct the ES cache after a crash? If not, how do you know whether it contains inconsistencies? Whose job is it to initiate the reconstruction? etc.