Hacker News new | ask | show | jobs
by takeda 3698 days ago
Because NoSQL is a hype.

Many of the NoSQL essentially takes us back to 60s before Codd came up with relational model[1] These ideas tend to come back once in a while [2][3], but so far nothing is better than relational model.

NoSQL still makes sense in many cases (generally when your specific use case does not need all guarantees of ACID), you can get in return higher performance or horizontal scalability.

MongoDB is aim to be generic database but rides on the wave of NoSQL "coolness". It was created by people who had no experience in databases and are learning as they go[4]. As they started adding things that are essential for database they realized it's not that simple.

Currently MongoDB is outperformed by Postgres. In fact there is an application called ToroDB which provides protcol compatibility for Postgres that emulates MongoDB and even that outperforms Mongo. Mongo also doesn't scale well horizontally, so essentially you don't really don't get any significant advantage.

[1] https://en.wikipedia.org/wiki/Hierarchical_database_model

[2] https://en.wikipedia.org/wiki/Object_database

[3] https://en.wikipedia.org/wiki/XML_database

[4] For example they started with mmap'ed memory regions to store the data. Did not initially use fsync() to make sure data is saved on disk. In a way it reminds me of MySQL several years ago. It's much better now than it was in the past, but it has a lot of warts left of from the past.

4 comments

MongoDB is really, really bad. I've never come across another product that was so horrible and yet so widely used.

MongoDB: for when you don't need consistency, availability, or partition tolerance.

There are some really good NoSQL products out there. I seriously think RethinkDB is on par with Postgres. I've also used Cassandra and BerkeleyDB and they're both decent. But unless some core part of your business logic is pathological to implement in SQL (like Reddit's comment trees) you should go with Postgres.

No no no, you don't understand at all. Mongo DB is webscale. /slashdot-comment
Right, absolutely true. Mongo is just such a bad piece of software.
> so far nothing is better than relational model

my favorite quote: "the relational model is rooted depth in the set theory and math is hard to fool"

it was on a post on how all NoSQL are ending up having to reinvent having, group and join clauses on their API because that's what apps do with data

A recent candidate said basically this during his interview, he got the job.
Neither HAVING nor GROUP BY are part of the relational model, though. Really, aggregation isn't part of the traditional relational model. People just tend to associate it with RDBMSes.
Currently MongoDB is outperformed by Postgres

I hate hearing absolutist dogma like this. Some things are faster in Postgres, some things are faster in Mongo. We are migrating our analytics db from Mongo to PG but we hit a wall because SELECT COUNT(DISTINCT x)) performs abysmally in Postgres. Mongo's aggregation framework is an immature PITA but it performs this little trick well enough that we're pretty much stuck keeping Mongo around unless we want to use MSSQL or Oracle or something else with a better (and much more expensive) query planner.

We still love (and prefer) Postgres but it is not a pareto improvement. There are always tradeoffs, and this kind of fanboyism just speaks to inexperience.

There's no reason for SELECT COUNT(DISTINCT x)) to perform badly in Postgres, as long as you have an appropriate table design and indexes.

Also MSSQL's query planner isn't better than Postgres', I work with both. Postgres does have its quirks though, especially with the MVCC row expiry.

> There's no reason for SELECT COUNT(DISTINCT x)) to perform badly in Postgres, as long as you have an appropriate table design and indexes.

Meh. Postgres' planner doesn't know how to generate a skip-scan/loose index scan for DISTINCT. You can write it yourself, but it's a bit painful: https://wiki.postgresql.org/wiki/Loose_indexscan

If you have a low cardinality that can be a huge efficiency difference.

Actually that's wrong:

    EXPLAIN ANALYZE SELECT COUNT(DISTINCT(calc)), calc FROM price_history GROUP BY calc;
                                                          QUERY PLAN                                                           
    -------------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=10713.04..11381.06 rows=10 width=5) (actual time=1010.383..1073.263 rows=11 loops=1)
       Group Key: calc
       ->  Sort  (cost=10713.04..10935.68 rows=89056 width=5) (actual time=1010.321..1049.189 rows=89041 loops=1)
             Sort Key: calc
             Sort Method: external merge  Disk: 1392kB
             ->  Seq Scan on price_history  (cost=0.00..3391.56 rows=89056 width=5) (actual time=0.007..20.516 rows=89041 loops=1)
     Planning time: 0.074 ms
     Execution time: 1076.521 ms
    (8 rows)

With Index:

    EXPLAIN ANALYZE SELECT COUNT(DISTINCT(calc)), calc FROM price_history GROUP BY calc;
                                                                         QUERY PLAN                                                                          
    -------------------------------------------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=0.29..2804.82 rows=10 width=5) (actual time=0.117..47.381 rows=11 loops=1)
       Group Key: calc
       ->  Index Only Scan using price_history_calc_idx on price_history  (cost=0.29..2359.52 rows=89041 width=5) (actual time=0.054..18.579 rows=89041 loops=1)
             Heap Fetches: 83
     Planning time: 0.208 ms
     Execution time: 47.416 ms
    (6 rows)

Actually that is called a index only scan, and happens when you have a data type that is inside your index. Which means if you need a aggregate you could try to index everything you need. Mostly a aggregate only contains some values of a row so a index is mostly not a problem.
I didn't say an index couldn't be used at all. Just not to actually make the query fast. This will get all duplicates for a value from the index, before going to the next value. If you have a couple thousand or more of each to be counted value that'll make the query rather slow.
An index only scan isn't the same as a loose index scan. They're orthogonal tricks.
You would need to add support for skip scans to the query planner, and adding feature to the query planner is rarely simple.
We are migrating our analytics db from Mongo to PG but we hit a wall because SELECT COUNT(DISTINCT x)) performs abysmally in Postgres.

An alternative is PipelineDB, which is built on Postgres (and drop-in compatible, supposedly) and should provide an efficient implementation for those kinds of queries: https://www.pipelinedb.com/

I've never used it, though.

"We are migrating our analytics db from Mongo to PG but we hit a wall because SELECT COUNT(DISTINCT x)) performs abysmally in Postgres."

https://github.com/aggregateknowledge/postgresql-hll ?

or even

https://www.periscopedata.com/blog/hyperloglog-in-pure-sql.h... ?

Try:

    SELECT COUNT(*) FROM (SELECT DISTINCT x FROM table) AS temp;

The

    SELECT COUNT(DISTINCT x))
Is slow because it performs sort.
> It was created by people who had no experience in databases and are learning as they go[4]. As they started adding things that are essential for database they realized it's not that simple.

This sounds exactly like what MySQL development looked like to me early in its rise to popularity.