Hacker News new | ask | show | jobs
by towelrod 4597 days ago
I think you are missing the point of the article. If you read down to the Epilogue it explains how the "perfect" application still didn't work with MongoDB once the clients started asking for more features.

My read was that even when you think you don't have "graph like relationships" in your data, you actually do.

The original author did say this, but I would like to add: if you don't have "graph like relationships", then your data is pretty trivial and any data store will do.

1 comments

From another comment I made, on why I don't think is a good article even using the proposed thesis of "mongo doesn't work for graph like relationships":

Even though their data doesn't fit well in a document store, this article smacks so much of "we grabbed the hottest new database on hacker news and threw it at our problem", that any beneficial parts of the article get lost.

The few things that stuck out at me:

* "Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production." - So you did absolutely no research

* "What could possibly go wrong?" - the one line above the image saying those green boxes are the same gets lost. Give the image a caption, or better yet, use "Friends: User" to indicate type

* "Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars." - Yep, and since users are indexed by their ids, this is extremely easy.

* "What happens if that step 2 background job fails partway through?" - Write concerns. Or in addition to research, did you not read the mongo documents (write concern has been there at least since 2.2)

Finally, why not post the schemas they used? They make it seem like there are joins all over the place, when I mainly see, look at some document, retrieve users that match an array. Pretty simple mongo stuff, and extremely fast since user ids are indexed. Even though graph databases are better suited for this data, without seeing their schemas, I can't really tell why it didn't work for them.

I keep thinking "is it too hard to do sequential asynchronous operations in your code?".

I'm pretty ignorant of MongoDB so I'm genuinely interested in your response: How would you solve the problem in the epilogue, namely "a chronological listing of all of the episodes of all the different shows that actor had ever been in"?

Did Sarah model the data poorly ("We stored each show as a document in MongoDB containing all of its nested information, including cast members").

Or is there an easy way to extract that information that Sarah just doesn't know about yet?

Keep in mind the constraints in the article, for example: some shows have 20,000+ episodes, actors show up in 100s of shows, and "We had no way to tell, aside from comparing the names, whether they were the same person".

The last part seems like a really straightforward relational critique to me. If you don't break the actors out into unique entities then you can't compare them across shows. But if you do break them out into unique entities, then how to you present the show information without doing joins?

  > Did Sarah model the data poorly ("We stored each show as a 
  > document in MongoDB containing all of its nested 
  > information, including cast members").
Yes, they modeled the data poorly.

In this example, we have a TV Show, which is modeled as an entity (document). This TV Show has a list of cast members, each one modeled by a nested object.

In a relational database, this type of relationship would be modeled by having a TV_SHOWS table, a CAST_MEMBERS table with a foreign key to the TV_SHOWS table, and a CASCADE DELETE relationship to ensure that if a TV_SHOW is deleted, the related CAST_MEMBER records are also deleted.

This is obviously too strong a relationship between CAST_MEMBERS and TV_SHOWS. (In OO we'd call this a "component" relationship, that is, we're saying that a tv show is composed of cast members, and if we destroy the tv show we destroy the cast members as well.)

They should have modeled CAST_MEMBERS as true entities, by making them documents in their own collection, and storing a list of Cast Member IDs in each TV Show.

  > But if you do break them out into unique entities, then 
  > how to you present the show information without doing 
  > joins?
You must join, albeit in MongoDB you do this in the application layer, not the database, so:

1. Query the cast members collection to find the cast member id. 2. Query the tv shows collection to find all tv shows with cast member id in the cast members set.

Those of us who sharpened our teeth using relational databases have trouble seeing past "two trips to the database" in the above strategy, and that's probably why there's an urge to embed documents rather than to query two collections sequentially. Resist this urge, as it's as as bad as the urge to denormalize, i.e. there'd better be a damn good reason to do it.

> This is obviously too strong a relationship between CAST_MEMBERS and TV_SHOWS.

... huh?

> They should have modeled CAST_MEMBERS as true entities, by making them documents in their own collection, and storing a list of Cast Member IDs in each TV Show.

So instead of a one-to-many relationship, they should use a one-to-many relationship expressed in a different notation?

MongoDB doesn't forbid you from having entities and relations. It just doesn't support them in the same way that SQL databases do. Ditto for CouchDB, etc.

You end up having to do some joins yourself still, but this is often appropriate. Imagine that the "actor" entity contains a complete bio, including family history with relationship to other actors, links to wikipedia & fan sites, etc. When you're displaying the page for episode #202 of "Everyone Loves MongoDB", you don't want to retrieve all that data for all the actors. You're not going to display it all on the episode page anyway. Instead, you just need an ID (to href an a and src an img) and probably a small amount of denormalized stuff (name, for the img alt ...). Since that's what you need, that's what you store.

There's a limit to how far you can denormalize schemas before it is no longer helpful. The author explores this limit, and finds that MongoDB doesn't make the limit go away.

You're basically saying: don't use mongo. It's trivial to emulate a blob of data in a relational database; just use a... blob of data. Or any of the many, many other options at you fingertips. Conversely, manually implementing efficient joins is a total hassle and it'll probably end up slow and brittle. At the very least you'll need indexes and that means an (implied) schema.

So in the normal mongo usecase for storing (as opposed to caching) data with relations, let me see if I can summarize:

- you can have relations, it's just mongo won't help you deal with then: you just need to implement them yourself.

- you can have (actually need) a schema, it's just mongo won't help you deal with that; you'll need to implement that yourself. Have lots of fun with schema-changes, especially because...

- Since you're changing decoupled entities, you need to keep them in sync. You can (and probably should) use transactions, but mongo won't help you with that. You also probably want foreign keys, but mongo won't help you with that either. Migrations on mongo are a special kind of terrifying.

But hey, on the upside, it can store structured blobs, and it's probably hardly any slower that your filesystem, which could do that too.

You could absolutely do the same thing with Postgres (or SQL Server) and computed indexes over JSON (or XML) blobs. Of course, then you'd have exactly the same schema migration issues.

My point was more that a lot of the time, if you structure your data right (and get the right balance of denormalization) you don't need joins very much and so the lack of them isn't really a big disadvantage.

> Keep in mind the constraints in the article, for example: some shows have 20,000+ episodes, actors show up in 100s of shows, and "We had no way to tell, aside from comparing the names, whether they were the same person".

As others have pointed out, it requires two trips to the database. Given their architecture (distributed nodes), network latency is minimal, so this is essentially two calls to the database.

show { _id, title }

actor { _id, appearedIn : [id] }

db.find({"title":"awesomeshow"},{"_id":1}) db.find({"appearedIn" : showId})

Each actor is unique in the database, when you query, you get back unique actors. I'm not sure why they're scared of joins (or multiple queries in mongo).

The question you ask yourself is not whether you're joining, but how often you're joining. If you're not joining often on actors and shows, document databases can work better, since you represent the show and all its episodes without having to join.

Another "issue" occurs to me. It seems likely that the data coming in about TV shows, especially old ones with decades of episodes, would be a bit "dirty". This sort of thing just slides right into a document store, but a relational one would have some problems with that. How do we know e.g. that "Bryan Cranston", "Bryan Lee Cranston", and "Brian Cranston" are the same (or different) actors? Of course these things can be fixed with enough manual (or, even better, user) intervention, but the time and place for that are after you've got the data in the system, not before.
> How do we know e.g. that "Bryan Cranston", "Bryan Lee Cranston", and "Brian Cranston" are the same (or different) actors?

In the USA, the various professional creative guilds enforce uniqueness.

Your general musing is right, but the problem of source-data quality is generally considered to be distinct from the design of schemata.

Yeah, the comment on graph databases seemed a bit too flippant.