| > There are a lot of applications where your database is not the or even a source of truth, but effectively a big cache that you can either fully rebuild or where rebuilding isn't necessary (the data is too "fast moving" for there to be much point). This use case doesn't mean it's okay to lose data randomly. Cache invalidation should happen intentionally via an intelligent algorithm, and other data stores (such as Redis) provide this. > Here is another: A search engine for classifieds where the source of truth of 99%+ of listings were external feeds that'd get re-crawled at least once a day. If we lost a few million updates, who cared? If it was few enough, we'd just let the normal updates take care of it. If we had a major problem, or needed to do a major update that'd have compatibility implications, we could just do a re-crawl of all the feeds. Just because some data loss is acceptable doesn't mean it's desirable, and while I agree that most of the time some data loss doesn't matter, I don't think you can actually say it never will. What if the 1% listing happens to include an client's feed when the client is doing a major product launch? > You're scaling reads by replication, and so the vast majority of your data exists in multiple data stores. You may choose one you consider reliable for the source of truth, and the rest are basically caches, but you may want/need something more capable than a straight up key/value store for various reasons. Again, cache invalidation should happen intelligently via a well-thought-out algorithm, not by randomly dropping data, and there are solutions which supply that. > I believe the vast majority of data stores I've worked with have been ok to suffer a total loss of because most of them are not the source of truth, and the source of truth can be re-queries fast enough for it to not be a big deal to deal with losses in secondary copies. Okay, I don't buy this, but let's say it's true. What about the memory leaks? And what if your needs change, and you're no longer okay with data loss? Are you willing to take the risk that you're going to have to rewrite your storage layer because you chose a data store that drops data randomly, and your needs changed? |
> This use case doesn't mean it's okay to lose data randomly.
It often does, as long as the frequency is low enough. I've worked on many systems where the decision was explicitly taken to do things in ways we knew would lose data because we could do an easy calculation of "up to x% lost on ingestion will lead to y% increased average increase in updates" and determined that was worth it if it reduced the time spent on a feature or the server costs by a certain amount. It's a completely reasonable tradeoff as long as you can quantify the risk and quantify the cost (here's a tip for something that scarily few engineering organizations does: track not just time spent on projects, but calculate the cost per feature and report on it; it very quickly changes organisational priorities when managers see that what they thought was a minor change ended up costing $10k engineering time)
> and other data stores (such as Redis) provide this
I love Redis, but Redis doesn't offer the same query abilities as document stores like MongoDB. There are plenty of cases where Redis is the better choice, and there are plenty of cases where MongoDB is an option where Redis simply isn't. It depends entirely on the use case.
> Just because some data loss is acceptable doesn't mean it's desirable,
It's never desirable, but other properties of such a change may make the change desirable enough to make it worth dealing with any such data loss.
E.g. I've run platforms putting message queues entirely in memory because the performance / cost reduction was more important than not losing messages in the case of a crash, for example.
> What if the 1% listing happens to include an client's feed when the client is doing a major product launch?
For the case in question it didn't matter at all. Clients would use an XML-RPC call to indicate feed updates, and if they wanted to force re-indexing, they could, and the system adjusted "last ditch" scheduled re-retrievals based on typical update frequency anyway. Or we could force re-indexing too. In any case they had no guarantees of their data being made available on a specific time frame. In practice this was not a problem even once.
This is not hypothetical - we had cases where we restarted servers with said in-memory message queues mentioned above, purging millions of messages without issue, as the system was designed from the outset with the explicit intention that losing data most places was perfectly ok.
> Again, cache invalidation should happen intelligently via a well-thought-out algorithm, not by randomly dropping data, and there are solutions which supply that.
Do you think we just randomly say "oh, lets just lose data for fun?" In each instance I've designed systems like this, it has been a very deliberate choice of determining 1) what the cost and risks of losing data would be. In the above mentioned systems, data was refreshed so regularly, with abilities to do it faster, that the cost was deemed to be extremely low, with virtually no risks (nobody had any guarantees of getting their data into our system). 2) what the costs of avoiding/mitigating the risk of a loss would be.
In these cases, deciding to accept the risk because of the low cost meant that we could opt for less redundancy and fewer servers (e.g. the in-memory queue processing cost us 1/10th as much in server resources as if we hit disk). 3) Other factors: The "cache invalidation" in the case above was the reason for the frequent re-retrieval of feeds. We had no way of knowing whether or not a feed had changed without trying to retrieve it, and we could not depend on the sender supporting E-tag etc. (it was cheaper to just assume everything would be broken and design for it than invest resources in providing support to fix these things), so we had to be prepared to retrieve the whole feeds and compare against the database anyway. We could optimise that by keeping hashes etc., but in practice we found this gave minimal benefits compared to the massive cost reductions we got from being able to reduce redundancy across the entire indexing chain.
And fact is servers crash, disks die, errors in any component could corrupt data. You can choose to spend a fortune trying to prevent that through additional redundancy etc., and you should if your data is valuable and a source of truth. But when it isn't, it is a perfectly valid strategy to simply accept that things will fail and design the system to self-heal.
> What about the memory leaks?
As above: Serves crash. So if your system needs to be available, you need according redundancy. The typical additional scaling cost (accounting for a slightly increased risk of simultaneous restarts, requiring a small amount of extra capacity) of dealing with restarts on resource constraints are often perfectly ok. I tend to design all systems I work on to allow as many components at possible to be restarted at will, because you need to be able to accommodate upgrades etc. anyway. If you first do that, then allowing automatic restarts in the case of failure-types where that is acceptable tends to be trivial.
If crashes or forced restarts poses a risk to your system, then your overall architecture has a substantial unmitigated risk. If it doesn't pose a risk, then memory leaks (unless extreme and rapid/frequent) are rarely an issue.
> And what if your needs change, and you're no longer okay with data loss? Are you willing to take the risk that you're going to have to rewrite your storage layer because you chose a data store that drops data randomly, and your needs changed?
Then you have to pay some of the cost you saved by making those decisions in the first place. It's a calculated risk. Firstly it is rare. Secondly, it depends on your type of business - in a startup for example, future cash is on average far less worth than current cash, and cutting server costs and complexity substantially now is more important than avoiding additional engineering costs a year or two down the road.
E.g. I recently billed a client $30k/more on a project over ~18 months or so because they wanted to move the system between hosting providers a couple of times to make use of free credits of ~$30k. So overall it looks like they gained nothing. The reason they did it was that they did gain something: They deferred most of that cost by more than a year, in which time they did two funding rounds, and the cash they finally paid the remainder with cost them far less.
You have to account for this when considering the potential costs associated with risk of change. When you do, which decisions makes sense often change dramatically.
And you have to account for risk of change for all technical decisions anyway - this applies just as well if you e.g. pick an RDBMS instead and your needs requires you to add on something that will scale cheaper down the line.
So the answer is: Yes, I am often prepared to take that risk when the financials supports it. You need to understand your risks and understand your costs, and understand how the value of your cash on hand is expected to change.
All of this basically boils down not so much to technical choices, but project management, risk management and accounting.
This, to me, is also one of the areas that distinguishes software engineering from "development". Managing project cost and projected returns and risks are all part of sound software engineering, but it's something far too many organisations don't spend time on at all, which is terrifying.