Hacker News new | ask | show | jobs
by bpye 2686 days ago
I'm tired and haven't often dealt with database systems. I'm struggling to see significant benefits for schema on read style systems - maybe progressive migration? I'm not convinced...
5 comments

When you want to do validation depends on when you can do something about it. I work with a NO-SQL DB at work and while it wouldn't be my choice for most things I would use a DB for, the lack of validation has some benefits. A good example is where you have no ability to validate input from a user, but where you need to store the data anyway. The last thing you want is your noisy data being kicked out by the DB because it doesn't follow a DB constraint. Sometimes you want to go in afterwards and say, "Show me all the data which is incorrect". This is also useful for dealing with important data sent by other systems which have been coded by people other than you. The get the data wrong (or are using older versions of specs, etc) but you want to store what they sent you anyway. Then you can go in later and sort it out by hand.

I don't think that kind of thing is particularly common, but there are definite use cases. In our particular case we use it for financial data where we want the data we are given even if it is flawed. I think the OP is 100% correct. You have to write that validation somewhere or else you are in big trouble. Usually it is easier and more convenient to do it at the DB layer, but sometimes you choose to do it somewhere else.

It sounds more like an edge case though. I can't imagine all the data you need to store may or may not be the right format, so I wouldn't switch my database just because one or two entities need this.

Anyway this is 2019 so PostgreSQL JSONB fields have got you covered. You can even efficiently query the JSON objects within them.

I'll give a qualified "yes" to that. I agree there is no particular reason you can't use PostgreSQL. There are some advantages to the designs of some No SQL DBs if it fits your use case (immutable data, the ability to replicate easily). For our application eventual consistency was a really good fit. Also we wrote it 10 years ago :-) Even still, we often muse about replacing what we're using with PostgreSQL.

The main reason I wanted to reply to the question was that sometimes I see people who just can't get past not enforcing a schema at the DB layer for your whole data model. It really is crucial to understand that doing so means that bad data doesn't end up in your DB. This isn't always what you want. Like I said, not super common, but not unheard of either.

The underlying technology is pretty unimportant as long as you can do what you need to do. I've historically never really been a No SQL DB fan (there are very few downsides to relational data!!!) However, we've been using CouchDB for the odd thing and IMHO it has its place. Interestingly, I think it was my boss who originally selected it and he's gone very cold in that direction, where I've warmed to it while using it. I think the main thing is to understand exactly what benefit it is giving you (in our case easily replicated data with immutable change sets) and not give in to the hype of "OMG! You don't need a schema!", which is just not true. I've never asked him, but it is possible that my boss thought it would make life easier not to have to deal with schemas and DB migrations, and when it actually made things harder he got upset. I came into it knowing these things, but not really understanding the other benefits, which is why I warmed up to it.

If we were to start again, I think we would almost certainly go the PostgreSQL route, but I can see places where we would have some problems. It's probably a wash, really -- which is why we've not seriously tried to move away from CouchDB.

Unlogged jsonb tables in postgres have generally made nosql systems look pretty bad. I'm really happy the industry finally came up with vitesse so we could have a middle option between "My ACID database needs to scale writes so I'll roll my own fragile sharding layer" and "give up all attempts at schema and consistency and transactions".

Vitess is a really comfortable middle ground of fairly familiar database semantics within a partition.

Same with Citus for Postgres. Or CockroachDB / TiDB for a rebuilt natively-distributed modern RDBMS.
The default design pattern for storing potentially invalid data with RDBMS is to (usually bulk) load the data in tables without constraints (loading tables), then do the validation in the database, and move valid records to their final tables.
I do a lot of work with both traditional RDBMSes and NoSQL databases.

The main question I would ask is: Is your data schemaless? Often it is - especially when storing what we'd normally call a "document". Heavily polymorphic data is often better stored schemaless. And sometimes you don't necessarily have the schema in advance (common when storing "other people's JSON").

You can store schemaless data in Postgres via the JSONB type, so this isn't necessarily a "Mongo vs Postgres" issue, but more of a general data modeling issue.

As a point of reference, the folks that struggle with schemaless tend to be the ones using Javascript, Ruby, or other type-ambiguous languages. Schemaless is less of a problem in Java and other languages where class structures enforce your schema.

Not having to know all / as many of the structural details up-front could be of value in some use-cases. It can translate to reducing time-to-start cutting code, which can (in some cases) be a business priority, and can lead to identifying critical dependency problems earlier in development.

I'd happily agree that's an inappropriate model in close to 99% of cases, and that even if it was the right model one could (and most likely should) still use a decent database for this anyway.

I can't speak to document stores very well, but one spot where schema-on-read makes sense is in data warehousing type applications. One of the potential troubles with the traditional ETL approach is that transforming the data to fit a fixed schema almost always involves some information loss that might make the data less suitable for answering certain questions.

That's fine if you can predict what questions your business intelligence or data science team will be asked ahead of time, but, realistically, you can't actually do that. Using a schema-on-read data warehouse instead is a more costly option, but also leaves you more able to respond to changing business demands.

One pretty cool use of schema on read is Splunk. It wants to take in all the data and let you search, transform and visualize it in a variety of ways some of which you may not know until you start exploring what data you have.