Hacker News new | ask | show | jobs
by csytan 2743 days ago
I think you're asking the wrong question. The question should be: How did MongoDB become so successful?

IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API. MongoDB was the natural choice because they excelled at being accessible to devs who were already familiar with Javascript and JSON.

Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?

9 comments

I think you'll enjoy the series then, I spent several months investigating and made the same point about JSON and the Javascript-like CLI (plus great Node support, plus savvy marketing). For example:

> 10gen's key contributions to databases — and to our industry — was their laser focus on four critical things: onboarding, usability, libraries and support. For startup teams, these were important factors in choosing MongoDB — and a key reason for its powerful word of mouth.

Startup Engineers and Our Mistakes with MongoDB

https://www.nemil.com/mongo/2.html

The Marketing Behind MongoDB

https://www.nemil.com/mongo/3.html

Halfway into part two, this is very good so far. Thank you for the effort you put in (it really does show throughout).
> I think you're asking the wrong question. The question should be: How did MongoDB become so successful?

Marketing, marketing, and more marketing. Mongo was written by a couple of adtech guys.

> Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?

I remember being underwhelmed by two things at the one MongoConf I went to earlier this decade:

1.) My immediate boss was an unfathomable creep who was there mostly to pick up women

2.) Mongo was focused on how to work around the problems (e.g. aggregate framework) rather than how to solve them.

I can't recall ever seeing a Postgres rep, but I can recall having worked out a PostGIS bug with a fantastically tight feedback loop. The Postgres documentation and community are nothing short of amazing.

Meanwhile with Mongo I watched as jawdropping bugs languished. IDGAF what the reps say, anyone with even a few years experience should've been able to see through the bullshit that Mongo/10gen was/is selling.

> IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API.

The thing I dislike about this type of comment – although I now notice yours doesn't explicitly say this – is the implication that devs don't like SQL because they're lazy or stupid. Well, sometimes that is probably true! But there are some tasks where you need to build the query dynamically at run time, and for those tasks MongoDB's usual query API, or especially its aggregation pipeline API, are genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties. For anyone not familiar, just look at how close to being a native Python API pymongo is:

    pipeline = [
        {"$unwind": "$tags"},
        {"$group": {"_id": "$tags", "count": {"$sum": 1}}},
        {"$sort": SON([("count", -1), ("_id", -1)])}
    ]
    result_cursor = db.things.aggregate(pipeline)
Of course you could write an SQL query that does this particular job and is probably clearer. But if you need to compose a bunch of operations arbitrarily at runtime then using dicts and lists like this is clearly better.

Of course pipelines like this will typically be slow as hell because arbitrary queries, by their nature, cannot take advantage of indices. But sometimes that's OK. We do this in one of our products and it works great.

With JSONB and replication enhancements, Postgres is close to wiping out all of MongoDB's advantages. I would love to see a more native-like API like Mongo's aggregation pipeline, even if it's just a wrapper for composing SQL strings. I think that would finish off the job.

Elixir's primary database wrapper, Ecto [0], lets you dynamically build queries at runtime, and also isn't an ORM. Here's two examples directly from the docs:

  # Query all rows in the "users" table, filtering for users whose age is > 18, and selecting their name
  "users"
  |> where([u], u.age > 18)
  |> select([u], u.name)

  # Build a dynamic query fragment based on some parameters
  dynamic = false
  
  dynamic =
    if params["is_public"] do
      dynamic([p], p.is_public or ^dynamic)
    else
      dynamic
    end
  
  dynamic =
    if params["allow_reviewers"] do
      dynamic([p, a], a.reviewer == true or ^dynamic)
    else
      dynamic
    end
  
  from "posts", where: ^dynamic
Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.

My main point, though, is that you don't need to reach for NoSQL if all you need is a way to compose queries without string interpolation.

[0] https://github.com/elixir-ecto/ecto

As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline unless it can do analogous things to Postgres's JSONB fields. For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?

Also, one of the benefits of Mongo's API is that it has excellent native implementations in numerous languages (we already use C++ and Python), so a suggestion to switch language entirely is not really equivalent.

> As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline

Huh? The aggregation framework is a solution to a mongo-only problem. Most other databases are performant, but Mongo suffers wildly from coarse locking and slow performance putting things into and retrieving things from the javascript VM.

> For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?

This sounds suspiciously like a SQL view.

Edit: But if you actually need an array in a cell, Postgres has an array type that's also a first-class citizen with plenty of tooling around it.

The "this" was referring to dynamically building queries (the GP comment by me) in Ecto (the parent comment by QuinnWilton). What you've said is a non-sequitur in the context of this little discussion. My whole original point is that raw SQL isn't right in all situations, and you appear to be arguing that I just use SQL instead.
I can't speak to every ORM or database interface in existence but ActiveRecord will happily handle Postgres arrays and let you use the built-in array functions just handily without having to write queries by hand. Ecto is less elegant, but you can still finangle some arrays with it.

As far as views are concerned, I don't know what to tell you. Sure, you'll probably have to craft the view itself by hand. The result is that you can then use most abstractions of your choosing on top of it though.

>For example, can it...

Yes. There will be a subquery and jsonb indexes need to be thought out in order to make it fast

> Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.

Ahh Elixir. My favorite language that really just tries so hard to shoot itself in the foot. I'm currently in the protracted process of trying to upgrade a Phoenix app to the current versions. Currently I'm at the rewrite it in Rust and try out Rocket + Diesel stage.

Diesel is... interesting and makes me long for Ecto (which is often used as an ORM although the model bits got split off into a different project).

Love the downvotes instead of comments. I've walked away from Elixir as the best practice deployment methodology (Distillery) is non-op on FreeBSD[1] and has been for a few months while the Distillery author is mum. All of this despite the vast love that the Elixir community seems to heap on FreeBSD.

Erlang and Elixir have plenty of promise but there simply is no good story for production deployments. Distillery and edeliver approximate capistrano, and that sounds great when it works (although I'd just as soon skip edeliver). But when it doesn't I'd much rather dig into the mess of ruby that is Capistrano than the mess of shell scripts, erlang, and god knows what else goes into a Distillery release.

Elixir is a really interesting language, but Phoenix seems to still be pretty wet behind the ears and very much in flux. Ecto too to a much smaller extent.

1: Some of the distillery scripts can communicate with epmd, some just give up.

Well... you can also use a modern ORM. I think "stitching ... text strings" is definitively not the way to go when interfacing a SQL database. My go-to ORM is Sequel[1]. I think their API is one of the best I've seen: you can choose to use models, but you can also work directly with "datasets" (tables or views, or queries) and compose them as you like. It's really powerful and simple.

[1]: http://sequel.jeremyevans.net/

> genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties.

You're using the Pymongo library as an example. Someone can just as easily use SQLAlchemy and not have to worry about those things.

I'm confused by the implication that someone doing things like the above would be writing in SQL. SQL is a little like assembly language in a game: You may need to drop down to it for some key highly-optimized areas, but you rarely need to directly use it for most tasks. While it's true that you should understand how it works so you don't generate queries that suck performance-wise, the same goes for Mongo's intricacies too.

Every language I know of has great ORMs which do this for whatever SQL flavors people tend to use on that platform. I write things like this all the time, and it gets turned into SQL for Postgres:

```` Article.where(author_id: 37).order(:modified_date, :desc).where.not(published: false) ````

When using an ORM correctly (and indeed, the less I'm using any of my own bits of SQL the more this is true) I am also protected against injection attacks.

I'm not saying NoSQL has no value, but I believe it to be the wrong tool for data that lends itself to an RDBMS. If you have a bunch of documents who have deeply nested or inconsistent structures and where it makes no sense that you'd want to query by something other than the primary key, sure, it's a no-brainer to use a NoSQL system. For a CMS, which has been implemented thousands of times in RDBMSs, it is madness though. I cringe at realizing that apaprently there are developers out there who have avoided learning SQL entirely in their career out of fear, and as a result have to use Mongo for every application because that's the only thing they know how to do. I'm sure they're out there, but I wouldn't hire one.

I’ll try to avoid a flame war, but since you’re using python, SQLAlchemy allows for composing sql strings.
Yes, an ORM, not SQL itself.
You can compose queries, using the queryAPI in SQLAlchemy, without touching the ORM.
SQLAlchemy is not an ORM. There’s a companion ORM project if you want it, but it’s not necessary.
Yes? What's the qualitative difference between using a best-of-breed SQL ORM and the Mongo API?
For it to replace MongoDB's aggregation pipeline, it would need to play nicely with JSONB. Does it do that? This is the thing I'm really missing.

For example, if documents in the JSONB column all look roughly like this:

    {
        "someArrayField": [
            { "key": "steve", "value": 7 },
            { "key": "bob", "value": 15 },
        ],
        "someOtherField": [ "whatever" ]
    }

* Can I count the number of entries in someArrayField, summed across all records?

* Can I get the per-record mean of the "value" sub-field, summed across all records?

* Can I filter by records that have a "someArrayField" entry where "key" is "steve" and "value" is at least 10 (the above record should NOT match)?

Yes, you can!

The jsonb_array_elements function is roughly similar to Mongo’s $unwind pipeline op. It explodes a JSON array into a set of rows. From there it’s pretty simple aggregates to achieve what you’re looking for.

I was evaluating Mongo a couple months back to solve roughly the same problems. Eventually discovered Postgres already had what I was looking for.

More the point Postgres has an actual array data type (and has for a while). You don't need to shove everything into a JSON/JSONB blob unless you absolutely cannot have any sort of schema.
Not only arrays, you can, with some limitations, create proper types with field names, if your ORM supports that you should use that over JSONB if it fits.
Allow me to restate this question:

> Does it do that?

It was supposed to be clear from the context that this meant:

> Does building queries programmatically with SQLAlchemy do that?

Maybe I'm misreading your comment, but you seem to just be talking about writing queries directly in SQL.

If not, could you give an example/link of how to programmically build a query in SQLAlchemy that dynamically makes use of jsonb_array_elements? It would be hugely useful if I could do that.

There are some old examples of how to use jsonb_array_elemens in SQLAlchemy here: https://github.com/sqlalchemy/sqlalchemy/issues/3566#issueco...
I was speaking of SQL, but if you can write it in SQL you can usually map it to SQLAlchemy. If worse comes to worse, you can use text() to drop down to raw SQL for just a portion of the query.

SQLAlchemy’s Postgres JSONB type allows subscription, so you can do Model.col[‘arrayfield’]. You can also manually invoke the operator with Model.col.op(‘->’)(‘arrayfield’).

So you should be able to do something like:

func.sum(func.jsonb_array_elements(Model.col.op(‘->’)(‘arrayfield’)).op(‘->’)(‘val’))

(Writing on my mobile without reference, so may not be fully accurate)

It absolutely can, but in my experience, 99% of the time, choosing to make a data field JSON/JSONB ends up being a mistake.
To add to the pile of responses: in Scala, Slick is great library that lets you compose sql queries and fragments of queries quite effectively. (http://slick.lightbend.com/)

At my company we built a UI on top of Slick that lets users of our web app define complex triggers based on dynamic fields and conditions which are translated to type-safe SQL queries.

QueryDSL and jOOq as well for java.
From my POV the rise of 'NoSQL' some years back was tied into a number of things:

- Misunderstanding by most developers of the relational model (I heard a lot of blathering about 'tabular data', which is missing the point entirely).

- The awkwardness and mismatchiness of object-relational mappers -- and the insistence of most web frameworks on object-oriented modeling.

- The fact that Amazon & Google etc. make/made heavy use of distributed key-value stores with relatively unstructured data in order to scale -- and everyone seemed to think they needed to scale at that level. (Worth pointing out that since then Google & Amazon have been able to roll out data stores that scale but use something closer to the relational model). This despite the fact that many of the hip NoSQL solutions didn't even have a reasonable distribution story.

- Simple trending. NoSQL was cool. Mongo had a 'cool' sheen by nature of the demographic that was working there, the marketing of the company itself.

I remember going to a Mongo meet-up in NYC back in 2010 or so, because some people in the company I was at at the time (ad-tech) were interested in it. We walked away skeptical and convinced it was more cargo-cult than solution.

I'm _very_ glad the pendulum is swinging back and that Postgres (which I've pretty much always been an advocate of in my 15-20 year career) is now seeing something of a surge of use.

> When was the last time you saw a Postgres rep at a college tech event?

Is a Postgres rep a thing?

I remember a hyperbolic readme or other such txt file for Postgres in the far-away long-ago time when everyone was on Slashdot. The author had written one of the most enthusiastic lovenotes to software I'd ever read, and that includes Stephenson's "In The Beginning Was The Commandline." It was a Thomas Wolfe level of ejaculatory keenness. I'd love to read it again if anyone else knows where I can find the file. So, even if there aren't actual Postgres reps, there are most assuredly evangelists.
That's the point.
Absolutely - if you don't know SQL and you do know JSON, postgres looks scary and Mongo looks familiar.
Saying "I don't know SQL so I will just use JSON" really misses the point though. SQL is easy. Data is hard. NoSQL products offer to get rid of SQL which includes an implication that SQL itself was the challenge in the first place. The problem then is that you have lost one of the best tools for working with data.
I dunno that SQL is exactly easy, though. It's one thing to say "select statements are essentially identical to Python list comprehensions", but in practice I still have to look up the Venn diagram chart every time I need to join anything, and performance optimization is still a dark art. I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.
You could solve that by altogether dropping the Venn diagram metaphor when reasoning about joins. This is the number one problem I see with junior devs who have a hard time grokking SQL. If you think about a join as a cartesian product with a filter, where the type of join defines the type of filter, the reasoning is extremely easy.

Here's a good article about that: https://blog.jooq.org/2016/07/05/say-no-to-venn-diagrams-whe...

The hard parts of "SQL" are the hard parts of data. Joins aren't easier in Mongo. The performance optimizations you reference are tuning of a relational database, not SQL itself.

If you want to work with databases a domain specific language like SQL really provides a lot of value in solving these hard data problems.

> performance optimization is still a dark art

The idea is, in relational databases, that the vast majority of the time you shouldn't have to do it. Because you're writing your queries in a higher level (nay, functional) language, the query planner can understand a lot more about what you're trying to do and actually choose algorithms and implementations that are appropriate for the shape and size of your data. And in 6 months time when your tables are ten times the size, it is able to automatically make new decisions.

More explicit forms of expressing queries have no hope of being able to do this and any performance optimization you do is appropriate only for now and this current dataset.

> I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.

Mongo and Javascript don't solve that either. In fact you get additional problems by virtue of not being able to do a variety of joins. For extra points, you're going to need to go well beyond javascript with mongo if you want performance. 10gen invented this whole "aggregation framework" to sidestep the performance penalty that javascript brings to the table.

On the other side, the postgresql documentation is second to none. SQL isn't necessarily easy but the postgres documentation gives you an excellent starting point.

Here is the dirty secret of Mongo: you always have a schema, you just don’t have any tools for validating it or enforcing it or manipulating it.
You make it sound like learning SQL is like learning Assembler. It's not that hard. And ORMs exist in every language to abstract it all away.

PostgreSQL looks scary because it is a swiss army knife. It has a million different features and data structures. MongoDB does only one thing.

> You make it sound like learning SQL is like learning Assembler

It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."

You seem like the kind of person ready and willing to learn the right tool for the job. From my experience a few years ago on an accredit computing course that covered database admin and programming, this attitude is not representative of most of the software engineering students //unless// there's a specific assignment that requires particular knowledge.

Cs get degrees. And for plenty of developers out there, knowing one language (not even particularly well) gets jobs.

> It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."

And that's a big fat mistake. There are so many ways to shoot yourself in the foot with mongo such that simply knowing the language mongo uses for most of its queries while not actually knowing the particulars of how mongo uses that language… well that's just a road to a world of hurt.

For example, when I first inherited a mongo deployment I noticed the queries were painfully slow. Ah hah says me, let's index some shit. Guess what? Creating an index on a running system with that version of mongo = segfault.

After a bunch of hair pulling I got mongo up and running and got the data indexed. But the map reduce job was STILL running so slowly that we couldn't ingest data from a few tens of sensors in real time. So I made sure to set up queues locally on the sensors to buy myself some time.

Even in my little test environment with nothing else hitting the mongo server, mongod was still completely unable to run its map reduce nonsense in a performant manner. Mongo wisdom was: shard it! wait for our magical aggregation framework! Here's the thing: working at a dinky startup we can't afford to throw hardware at it especially that early in the game. Sharding the damn thing would also bring in mongo's inflexible and somewhat magical and unreliable sharding doohickey.

So I thought back to previous experience with time series data. BTDT with MySQL, you're just trading one awful lock (javascript vm) for another (auto increment). So I set up a test rig with postgres. Bam. I was able to ingest the data around 18x faster.

And that's the thing. Mongo appeals to people who are comfortable with javascript and resistant to learning domain specific knowledge. All that appealing javascript goodness comes with a gigantic cost. If you're blindly following the path of least resistance you're in for a bad time.

P.S. plv8 is a thing, and you can script postgres in javascript if you really wanted to.

I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline. So while "sure I'll just learn SQL" is great for a personal or school project, when you've got to get something done next week, it's better to take maximal advantage of the tools/skills/workflow that you already have.

IOW, it's not just laziness, it's a kind of professional conservatism. which is partly what gets older engineers stuck in a particular mindset, but it's also a very effective learned skill. The opposite is being a magpie developer, which results in things like MongoDB taking off :)

> I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline.

You have to do the exact same things with Mongo+JS (e.g. learning when to avoid the JS bits like the plague).

learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language,

SQL is a skill that rewards investment in it 1000x over, in terms of longevity. It has spanned people’s entire careers! What’s the shelf life of the latest JS framework, 18 months at most...

Yes, I know that, and that's why I know and use SQL instead of MongoDB. But that's a very similar reason to why I've resisted learning Rust, and Ruby, and React, and Docker, and Scala, and many more. I know I could learn the utter basics in a weekend, but I also know that those basics are utterly useless in a real-world context, and I would prefer to spend the weekend hacking on my open-source project in Python or C, which I've already invested the years into. And that's how engineers age into irrelevance..
Well, that and SQL has a somewhat undeserved reputation for being easy to learn, but also easy to screw up. Like you write a simple looking query and it turns out to have O^2 complexity and your system ends up bogged down in the database forever.

In practice people who fall into complexity traps are usually asking a lot more of their database engine than any beginner. It's usually not that hard to figure out the approximate cost of a particular query.

> Like you write a simple looking query and it turns out to have O^2 complexity

Or you have a simple fast query with a lovely plan until the database engine decides that because you now have 75 records in the middle table instead of 74, the indexes are suddenly made of lava and now you're table-scanning the big tables and your plan looks like an eldritch horror.

[Not looking at MySQL in particular, oh no.]

Learning SQL syntax isn't hard, but learning how to properly design relational databases is not something you can pick up from skimming blog posts.
I'd say that this goes for databases in general, not just relational ones.

Which brings us back to the original point: data is hard.

Assembly language isn’t hard. It’s actually quite simple. The issue with learning assembly is the absolutely useless otherwise domain knowledge.

Document-based storage definitely fits the generalised use-case better than tabular storage.

> Not only that, their marketing/outreach efforts were also aimed at younger developers.

I do remember a lot of MongoDB t-shirts, cups and pens around every office I was in around 2011-2013. When I would ask they would tell me that a MongoDB developer flew halfway across the world to give them all a workshop on it.

> The question should be: How did MongoDB become so successful?

Ability to store Algebraic Data Types and values with lists without a hassle of creating a ton of tables and JOINs. Postgres added JSON support since, plus there are now things like TimescaleDB, which didn't exist previously.

None of this makes sense.

ORMs have existed for decades so developers can use a SQL database just fine without knowing the language. So it's definitely not this.

It's more likely because Mongo is (a) is extremely fast, (b) the easiest database to manage and (c) has a flexible schema which aligns better with dynamic languages which are more popular amongst younger developers.

Postgres is faster at json than mongo. Also the pipeline query strategy of mongo is terrible to deal with. A schema should not be flexible. Now I have to write a bunch of code to handle things that should have been enforced by the database. Postgres is incredibly easy to manage with actual default security. I know the mongo tutorial says to not run the default configuration, then why is it the default configuration. It's so easy to manage anyone can take it over for ransom.

Mongo literally has no upside vs postgres.

To use an ORM and not get crap performance you still need to understand sql, and what is happening under the hood.