| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lmm 1893 days ago

> I was careful to choose popular, and not project opinions about SQL/NoSQL/etc. In my field, most of our data is relational and we use NoSQL for caching, queues, shared work, ETL performance, dashboards, etc. but at the end of the day for persistence, the RDBMS is where the “gold copy” data ends up.

I'd worry about using an RDBMS in that situation because it's fundamentally mutability-first. I prefer to regard the user's actions as the "gold copy" and the current-state-of-the-world as a transient derived thing (i.e. event sourcing), but that doesn't really play to the strengths of an RDBMS. You also have to make global decisions about transactionality (in particular, you can't easily commit a data write without committing updates to all your secondary indices), and the much-vaunted relational integrity can be a problem because you can only represent constraints for cases where the appropriate response to a constraint violation is dropping the write on the floor. And of course you can't safely allow the ad-hoc querying that SQL is designed for.

I do think traditional RDBMS make some sense at the end of an ETL pipeline - where the secondary indices can be a big help for the ad-hoc querying/aggregation that you want to do in a reporting environment. But transactions don't make sense in that environment because it's essentially read-only (or at least single-writer), so you're still paying for a lot you're not using. I wouldn't use JPA for this, but I wouldn't really write code for this kind of environment at all - the point is to expose the data in a structured form for non-code tools.

Essentially I find mature systems outgrow SQL databases - the case where an RDBMS actually fits is the early stages where you want to run ad-hoc reports against your live datastore, you want to keep the current state of the world rather than worrying about history, having to manually fail over to a replica if master goes down is ok, updating all your indices synchronously is fine because write performance isn't an issue yet, and you can put constraints in the database because blowing up with an error page is an adequate response when the user breaks the business rules. Using JPA increases the rate at which you can iterate on the system, which is the priority for that kind of use case.