Hacker News new | ask | show | jobs
by toomuchtodo 2686 days ago
I’ll allow it. Years later, I’m still miffed at Graylog (centralized logging engine) for having required MongoDB for a small bit of auth and meta storage that could’ve easily been done in MySQL or PostgreSQL (RDS even), forcing the need for that much more ops work for a small Mongo cluster for HA. Everyone deprecating the use of Mongo is a welcoming turn of events.

I shall recall these dark days to the next generation as “NoSQL Madness”, or more colloquially, “my schema is my app layer”.

9 comments

I once worked at a place that, at some point in the past, had developed some kind of semantic graph database that never gained market traction.

The author of it had ended up as CTO and kept seeking out uses for his work and ended up finding all kinds of odd places for it to live, including as the auth database in a large scale document analytics system and running part of the payroll.

We were constantly running into all kinds of scalability issues where this 12 year old component was often the source of our pain, but he'd never even entertain a conversation to eliminate it and consolidate or replace it with something better.

I worked for a company that was old but had a fancy new product. It was just about to be released when....

A competitor bought us for stupid money. They fired most folks (not my small department) had their CTO decide between our fancy new product, and the product he created. There was no question, our product was amazing, his Frankenstein was two pieces of equipment cabled together in three items the footprint ... and still did less, and was quite a ways away from being "ready".

If there was a silver lining, the Frankenstein product doomed that company and we got bought by the far more competent competitor they had.

Happy ending.
It's surprising how that works. I've been through numerous mergers, acquisitions, been in the company acquiring a company.

Lots of hair pulling and I can say that it was really hard to predict the outcome for any individual positive or negative in every case until long afterward.

This is the definition of software engineering hubris distilled into a few paragraphs
I think I know the company. We were almost acquired by them until they hit the rocks and we took a hit. Interesting to hear how that database was a pain to manage.
There aren't many of these databases around... Which database was that? is it still around?
I agree with the general sentiment just wanted to point out that "my schema is my app layer" has some valid use-cases. At my current job we deal with highly complex schemas (modelling insurance contracts). Correctness is paramount but at the same time you need flexibility in terms of change over time. Defining these schemas at the DB level would be painful, a bit like writing web apps in assembly. Languages like Haskell (which we use) help here with their rich and expressive typing capabilities, so we can model these complex domains expressively and use the DB only for persistence. Admittedly it does have downsides, like having to write your own migration layer, but for this use-case the benefits outweigh the pains.

PS.: We do use Postgres though ;) as it has 1st class json support and at the same time we have the luxury of using its relational capabilities where needed (think of a hybrid model).

My pet theory is that NoSQL took off purely because people were sick of having to manage schema changes.

Unfortunately, people reacted to being (justifiably) frustrated with schemas by throwing strict schemas out entirely, instead of making better schema management/migration tools.

also, DBAs hate developers. Developers want to make changes to the database to support their classes such as "i need to add a column" and the DBA response is "no." or, even worse, "fill out this ticket and it will get prioritized in the next scrum" meanwhile the developer is at a standstill.

I interviewed at southwest airlines years ago and i don't remember how it came up but we were talking about bottlenecks or something and i brought up the fact that having to go to a DBA to get a column added to a table, no matter how trivial, is a great source of delay. The whole room just nodded and looked at the floor, it was obviously painful for them.

NoSQL took the DBA out of the loop, now the developers were in full control of what was persisted and what wasn't. If they needed a new field they just made it so. On the flip side, DBAs got really freaked out and cried to whoever would listen.

In my experience you either have a DBA report to Developers or Developers report to a DBA. Never give them equal footing (even implied) because they'll just fight.

I think you need to also look at it from the DBAs stand point. If they did whatever the developers want and the system goes down or more likely other parts become slow, it is the DBA who gets the call.

In a large company like SW, the developer requesting some change for their app may have no idea how else the db is being used. What if their requested changes took down the db and prevented reservations from working?

My examples are extreme, but I have seen similar things in my years as both a developer and a DBA at times.

Took me a while to get back here but I do understand your point and it's totally valid. That door swings both ways.

That's why it's hard for the two camps to work side by side.

NoSql gave power to the devs at the expense of the experience and wisdom of the database folks. I bet many many applications and systems were completely screwed datawise more than once because of devs and NoSql.

To be clear, I've known DBAs who act exactly like you described in your original comment. Very annoying.

The best I've seen it work is to have a DBA on the team building the application.

Tell that to AWS. They've banned relational databases for specific workloads because Dynamo (nosql) provides more consistent performance, and is easier to operate.

Tons of conflation of Mongo's problems with those of nosql in this thread.

Did a project with Dynamo last year. Hope never to see it again.

Compared with RBMS tooling looks like a high school project.

DynamoDB has had major improvements in the last few months: e.g. you get dynamic capacity provisioned tables (avoids re/write capacity exceeded exceptions because of capacity planning uncertainty), and transactions, to name two. However, even if you have a hosted RDBMS it has an implicit read and write capacity throughput that you need to design for (e.. hotspots in partitions), you just hit it a bit later in your project. The bounded latency at scale (throughput, and size of tables) is the main win for DynamoDB.
It couldn't possibly because they're trying to push use of their own technology to force dogfooding.
That might make sense if they didn't also offer a plethora of their own built-in-house as well has managed oss relational DBS.
None of those OSS relational DBs offer Amazon lock-in the way DynamoDB does - it's more reliable income if someone uses it, but it also takes more convincing for people to use it. What enterprise would use it if Amazon themselves don't?.
This argument makes no sense at all. What does lock in have to do with Amazon dog fooding its services? They're... trying to lock themselves in? What?
Your average dev is not making decisions based on what might work best for one particular problem Amazon has.
Do you think all of the companies that chose Cassandra and Dynamo were wrong to do so? There's no use case for NoSQL? There were no lessons learned, value adds from NoSQL?

How do you explain the 'NewSQL' approach, which seems to be so clearly borne of what we've learned from NoSQL?

It should be obvious that NoSQL has value, regardless of the issues with one of the earlier NoSQL DBs.

I don't see a value other than fashion driven development, specially when comparing the bare bones browser GUI for Dynamo with something like SQL Server Management Studio or that whole story with primary and secondary indexes, with prices being set by index usage.
The Cassandra design was always a bit of a frankenstein without clear upside to me, but the nosql craze started great conversations.

There is certainly merit beyond fashion to the dynamo architecture, and there are workloads where (for example) HBase is simply the correct type of tool despite the lack of polish of its management interface

I think it also has to do with the source of data. If you receive data from a third party it’s easy to insert the whole document and figure out what parts you need later. If your data comes from your own client interface it makes more sense to build up the data model over time.
You could just plonk the data in a JSONB, BLOB or just plain old file on a disk with a URL pointing to it while you figure it out. And not introduce another super complex to support dependency...
Schema management and automated migration generation frameworks alleviate a lot of that headache. As long as the schema definitions are well structured and can be easily analyzed against a live db to find diffs and generate migration scripts. Django does this very well. You don't even need to use Django for the application, you can use it purely to define schemas and perform migrations on the DB. I'm sure there are alternatives for other languages.

People who got tired of dealing with schemas are now realizing that having zero schema is way more of a headache and way more work than the up front work of creating the schema.

> As long as the schema definitions are well structured and can be easily analyzed against a live db to find diffs and generate migration scripts. Django does this very well.

In my experience Alembic works more better.

Well, happily, those days are gone for good. Who needs NoSQL when you have the blockchain!
Glorious, cursed comment.
Setting up graylog was one of the worst mistakes I made. It took forever to get all the required software installed and configured and then it was taking up all the ram on the server doing fuck all.
There were single script 1-click installs for it in bash all over for me..but yeah jvm is a hog.
I think "jvm is a hog" misattribution is actually part of the cause of "jvm being a hog". It implies that you don't need to worry about your memory management and efficiency in Java, because any hoggyness is the jvm's fault. With GC, you can just get away with having memory leaks everywhere and allocate millions of objects per second without any catastropic consequences. JVM software written with knowledge of memory management and that allocation isn't free can perform just as well as any other platform. As Bryan Cantrill loves to say in ever single talk: "gc is not your problem, allocation is your problem, GC just defers the cost".
My experience with those one-liner installs is that they usually work... strictly speaking. They don't scale, they don't deal with edge cases, they know nothing of your environment. They install one piece of software (in an "interesting" way that won't upgrade), and that's it.
That was just Java doing Java things. Business as usual.
This is a nonsense comment doing language troll things.
Can you explain in more detail about “my schema is my app layer”?

EDIT: fixing autocorrect

MongoDB by default is (was? It’s been a while since I’ve used it) schemaless, which means all of your data validation must take place in your app instead of the database. Your data integrity is then only as good as your weakest validation.

Edit: scheme/schema autocorrect typos corrected. Thanks!

I've always preferred the terms "schema on write" and "schema on read" to schemaful/schemaless.

At some point, you are always going to have to get the data into some sort of consistent model, so that you can operate on it in a predictable and sane way. So there's no question of there being a schema, even if it's only implicit. The question is, do you apply the schema once, when you write to the data store, so that the data at rest is consistently structured? Or do you allow it to be inconsistent in the storage layer, and instead apply the schema and re-validate the data every time you read from it?

There are valid reasons why one might choose either approach.

Which is not to say that valid reasons always play in to the decision to choose one approach or the other.

I'm tired and haven't often dealt with database systems. I'm struggling to see significant benefits for schema on read style systems - maybe progressive migration? I'm not convinced...
When you want to do validation depends on when you can do something about it. I work with a NO-SQL DB at work and while it wouldn't be my choice for most things I would use a DB for, the lack of validation has some benefits. A good example is where you have no ability to validate input from a user, but where you need to store the data anyway. The last thing you want is your noisy data being kicked out by the DB because it doesn't follow a DB constraint. Sometimes you want to go in afterwards and say, "Show me all the data which is incorrect". This is also useful for dealing with important data sent by other systems which have been coded by people other than you. The get the data wrong (or are using older versions of specs, etc) but you want to store what they sent you anyway. Then you can go in later and sort it out by hand.

I don't think that kind of thing is particularly common, but there are definite use cases. In our particular case we use it for financial data where we want the data we are given even if it is flawed. I think the OP is 100% correct. You have to write that validation somewhere or else you are in big trouble. Usually it is easier and more convenient to do it at the DB layer, but sometimes you choose to do it somewhere else.

It sounds more like an edge case though. I can't imagine all the data you need to store may or may not be the right format, so I wouldn't switch my database just because one or two entities need this.

Anyway this is 2019 so PostgreSQL JSONB fields have got you covered. You can even efficiently query the JSON objects within them.

The default design pattern for storing potentially invalid data with RDBMS is to (usually bulk) load the data in tables without constraints (loading tables), then do the validation in the database, and move valid records to their final tables.
I do a lot of work with both traditional RDBMSes and NoSQL databases.

The main question I would ask is: Is your data schemaless? Often it is - especially when storing what we'd normally call a "document". Heavily polymorphic data is often better stored schemaless. And sometimes you don't necessarily have the schema in advance (common when storing "other people's JSON").

You can store schemaless data in Postgres via the JSONB type, so this isn't necessarily a "Mongo vs Postgres" issue, but more of a general data modeling issue.

As a point of reference, the folks that struggle with schemaless tend to be the ones using Javascript, Ruby, or other type-ambiguous languages. Schemaless is less of a problem in Java and other languages where class structures enforce your schema.

Not having to know all / as many of the structural details up-front could be of value in some use-cases. It can translate to reducing time-to-start cutting code, which can (in some cases) be a business priority, and can lead to identifying critical dependency problems earlier in development.

I'd happily agree that's an inappropriate model in close to 99% of cases, and that even if it was the right model one could (and most likely should) still use a decent database for this anyway.

I can't speak to document stores very well, but one spot where schema-on-read makes sense is in data warehousing type applications. One of the potential troubles with the traditional ETL approach is that transforming the data to fit a fixed schema almost always involves some information loss that might make the data less suitable for answering certain questions.

That's fine if you can predict what questions your business intelligence or data science team will be asked ahead of time, but, realistically, you can't actually do that. Using a schema-on-read data warehouse instead is a more costly option, but also leaves you more able to respond to changing business demands.

One pretty cool use of schema on read is Splunk. It wants to take in all the data and let you search, transform and visualize it in a variety of ways some of which you may not know until you start exploring what data you have.
Excellent perspective. I start using this at work!
Right, but most people using it use model or data repository patterns to ensure correctness. It does offer nearly infinite flexibility provided you use it correctly. You can add fields without any sort of DB work, you just start adding fields to rows as needed and let it catch up organically.

There are use cases where mongo makes a lot of sense. It's very popular in the node.js / RAD world for sure. I certainly have never been a huge fan by any means. Only relatively recently did they solve distributed writes.

Unfortunately, I’d argue PostgreSQL gets you all the same benefits with JSON storage (fairly equivalent to Mongo docs), while also giving you all the goodness of a relational, transactional, schema enforcing RDBMS. PGSQL became Mongo faster than Mongo could become PGSQL.
This is the same thing that happened in Java.

Other languages started prototyping features... that eventually just end up being implemented in Java.

Java did it way too slow, and that is a significant contributor to it being relegated to "legacy" in many areas. If it waited for the other languages to prototype stuff, it might have not been the case. The problem is that it waited for them to prototype it, refine it, release it, popularize it, and for their community to adopt it, before even starting to work on it in Java - which means that by the time they had it, most people who needed it were already elsewhere (not necessarily off JVM, just another language).

Lambdas were a very good example - if you look at the closest competitor, C#, it got the first take on them back in 2005. Then a major refinement in 2008, adding type inference. By 2010, lambdas were idiomatic in C#. Java, in contrast, released the first version in 2014. And even then, they're still less powerful.

A lot of useful features don't end up being implemented in Java and even if they do, libraries and frameworks have to update to take advantage of them. With a new language libraries are built from the ground up based on the new features.

For example adding async to the language isn't necessarily going to change your programs to be async because every widely used library has already adopted threads and doesn't support async yet and often never will.

Pretty much all of the things you'd want in a relational database are now present in Mongodb too.

The real benefit of MongoDB at this point is the ability to easily scale beyond a single machine with shards and high availability using replica sets.

Postgres will get you pretty far, but beyond a certain point the scaling story breaks down and you have to hack some sort of user space sharding solution. At that point all the schema update and backups become a nightmare.

Schemas: https://docs.mongodb.com/manual/core/schema-validation/

Transactions: https://docs.mongodb.com/manual/core/transactions/

SQL: https://docs.mongodb.com/bi-connector/master/

> Pretty much all of the things you'd want in a relational database are now present in Mongodb too.

On the "relational" front, it has denormalized-only schema validation and limits transactions to replica sets (so no transactions with sharding), while also recommending single-document transactions over multi-document transaction via denormalization. (FWIW, transactions aren't available in any open source release).

On the "database" front, it has a history of misleading users and remorselessly dropping data.

> Postgres will get you pretty far

Postgres is an actual relational database, open source, battle-proven with a good design and a great team behind it. It never claimed to be good at, let alone capable of, doing things it could not actually do (well, or at all).

> but beyond a certain point the scaling story breaks down and you have to hack some sort of user space sharding solution.

Scaling IS a hard thing, and presents itself quite differently to different use-cases. Nevertheless, horizontally scaling Postgres — for when one truly hits the petabyte-scale — is a problem that has been solved correctly many times before (out of core). For a similar-to-MongoDB method, check out Citus, with the assurance that it only adds to an actual database; as opposed to the MongoDB way of doing things: make up and promise magic scaling solutions that "Just Work", then try to build a database on top of it.

Most databases will never need this kind of scaling.
RDBMS only provides a limited degree of 'validation'. It still must exist fairly comprehensively in the app.
On the contrary, RDBMS provides far more opportunity for validation, because it has all the data at its disposal, which can be queried as needed without the expense of crossing the boundary.
What is the purpose of validation would you say with modern computers? At one time, specifying the exact number of chars was good for squeezing out as much storage as possible, but less so today.
Its still absolutely critical for almost everything. Some use cases:

* My code depends on this value always existing so make this not null * My code is doing math on this value so make sure it is always a number * This record belongs to other record, make sure other record can not be deleted while this one still exists

Modern computers change next to nothing with the need to validate data. The worlds fastest computer wont tell you how to add a number that doesn't exist.

Validation is almost always a function of business logic, not 'storage compression'.
First, most 'noSQL' DB's (including Mongo) have data validations anyhow, rendering the discussion almost moot.

" RDBMS provides far more opportunity for validation"

This can't be true. The application layer, which ultimately contains all 'knowledge' of all aspects of the business, including data from all other resources, can obviously 'provide more opportunity' for validation than any DB possibly can.

Moreover, 'validation' generally implies aspects which are inherently application specific ergo, doing this purely in the data layer almost implies an intersection of concerns.

Validation in almost every case must be done on the app layer, so anything we get from the DB is an added benefit.

Also, data generally has to be validated when it enters into the business logic, long before it gets into the DB, moreover, there are usually data elements that are not persisted, and must be validated anyhow, again illustrating the requirement for validation above the DB.

> The application layer, which ultimately contains all 'knowledge' of all aspects of the business

I've rarely seen a codebase outlive it's database, but I constantly see databases survive through multiple codebases.

It's extremely common to validate a piece of data not on its own, but how it relates to other pieces of stored data. Without transactional semantics, an application basically can't enforce these invariants w/ any reliability (or those semantics need to be ensured out of band, or w/ little data modeling tricks that tend not to scale well).

There certainly are invariants that are non-trivial or cumbersome to enforce strictly with a schema, but you can really only enforce them w/ a database that provides serializable transactions.

In many cases, schematization of data in the database is good for other reasons though (for instance, guaranteeing type-normalized data in the presence of multiple deployed versions of an app via accident or otherwise, ensuring your queries and updates are typesafe, etc.)

First of all, we were talking about validation of data in the database, specifically.

> 'validation' generally implies aspects which are inherently application specific

Not at all. Taking this at face value implies that some app can write data to the database that is valid according to that app, and then another app can read data that is invalid from its perspective, and have to deal with it. That doesn't make sense - data is data, it's either valid, or it's not. That's why the schema is about the data, not about the app.

> Validation in almost every case must be done on the app layer

For UX reasons, mostly, yes. But it's usually much more basic than what e.g. triggers would do in the DB itself.

I'm not saying that there's nothing to validate outside of the DB, either. But for the data that is in the DB, the DB itself can usually do a better job.

> The application layer, which ultimately contains all 'knowledge' of all aspects of the business,

Data always outlives the application. You could argue that some app + data lives on together, but then you have just poorly reimplemented what an RDBMS does for you up front.

Databases are, as their name suggests, closest to the data.

Applications generally can't recreate ACID properties and specifically, they shouldn't be trying to.

"Applications generally can't recreate ACID properties" - why would they?

ACID and 'data validation' are generally separate issues.

Data generally has to be validated as it enters the business logic, before it gets stored in a DB. While a DB may in some cases ensure that data adheres to a schema, this usually does not fulfill all of the validation requirements.

Validation often requires examine a model beyond "is this an int?". That model needs to be self-consistent. That requires atomic movements from consistent state to consistent state.

You can do that yourself. Or let the database do it. For things where you can't express it in a database schema, sure. But you'd be surprised how far it gets you.

> my scheme is my app layer ... MongoDB by default is (was? It’s been a while since I’ve used it) schemeless

https://www.google.com/search?q=define%3Aschema

Maybe I’m a rare exception, but I chose NoSQL early on when it was still “hot” and have never looked back. We’ve grown from a couple megabytes of data do several dozen terabytes and have had countless issues, but scaling our database was never one of them.
It was just a discovery phase.
"my scheme is my app layer"

To be fair you can do use schema validators in Mongo. Not sure it's widespread in practice. And there are other distributed databases that aren't document stores that have schemas and various subsets of SQL implemented.