Hacker News new | ask | show | jobs
by mikekchar 2686 days ago
When you want to do validation depends on when you can do something about it. I work with a NO-SQL DB at work and while it wouldn't be my choice for most things I would use a DB for, the lack of validation has some benefits. A good example is where you have no ability to validate input from a user, but where you need to store the data anyway. The last thing you want is your noisy data being kicked out by the DB because it doesn't follow a DB constraint. Sometimes you want to go in afterwards and say, "Show me all the data which is incorrect". This is also useful for dealing with important data sent by other systems which have been coded by people other than you. The get the data wrong (or are using older versions of specs, etc) but you want to store what they sent you anyway. Then you can go in later and sort it out by hand.

I don't think that kind of thing is particularly common, but there are definite use cases. In our particular case we use it for financial data where we want the data we are given even if it is flawed. I think the OP is 100% correct. You have to write that validation somewhere or else you are in big trouble. Usually it is easier and more convenient to do it at the DB layer, but sometimes you choose to do it somewhere else.

2 comments

It sounds more like an edge case though. I can't imagine all the data you need to store may or may not be the right format, so I wouldn't switch my database just because one or two entities need this.

Anyway this is 2019 so PostgreSQL JSONB fields have got you covered. You can even efficiently query the JSON objects within them.

I'll give a qualified "yes" to that. I agree there is no particular reason you can't use PostgreSQL. There are some advantages to the designs of some No SQL DBs if it fits your use case (immutable data, the ability to replicate easily). For our application eventual consistency was a really good fit. Also we wrote it 10 years ago :-) Even still, we often muse about replacing what we're using with PostgreSQL.

The main reason I wanted to reply to the question was that sometimes I see people who just can't get past not enforcing a schema at the DB layer for your whole data model. It really is crucial to understand that doing so means that bad data doesn't end up in your DB. This isn't always what you want. Like I said, not super common, but not unheard of either.

The underlying technology is pretty unimportant as long as you can do what you need to do. I've historically never really been a No SQL DB fan (there are very few downsides to relational data!!!) However, we've been using CouchDB for the odd thing and IMHO it has its place. Interestingly, I think it was my boss who originally selected it and he's gone very cold in that direction, where I've warmed to it while using it. I think the main thing is to understand exactly what benefit it is giving you (in our case easily replicated data with immutable change sets) and not give in to the hype of "OMG! You don't need a schema!", which is just not true. I've never asked him, but it is possible that my boss thought it would make life easier not to have to deal with schemas and DB migrations, and when it actually made things harder he got upset. I came into it knowing these things, but not really understanding the other benefits, which is why I warmed up to it.

If we were to start again, I think we would almost certainly go the PostgreSQL route, but I can see places where we would have some problems. It's probably a wash, really -- which is why we've not seriously tried to move away from CouchDB.

Unlogged jsonb tables in postgres have generally made nosql systems look pretty bad. I'm really happy the industry finally came up with vitesse so we could have a middle option between "My ACID database needs to scale writes so I'll roll my own fragile sharding layer" and "give up all attempts at schema and consistency and transactions".

Vitess is a really comfortable middle ground of fairly familiar database semantics within a partition.

Same with Citus for Postgres. Or CockroachDB / TiDB for a rebuilt natively-distributed modern RDBMS.
The default design pattern for storing potentially invalid data with RDBMS is to (usually bulk) load the data in tables without constraints (loading tables), then do the validation in the database, and move valid records to their final tables.