I've generally found SqlServer's query optimizer to be a hellish nightmare of broken dreams, so if Postgres' is even worse I'm giving up and going back to flat files.
I've found that SQL Server's is generally pretty good, with the big pitfall being the system's timeout for the query planner/compiler. Query compilation timeouts can be really frustrating to work on because, often, the query's complexity is a requirement.
The only other problem is the parameter sniffing problem for stored procedures, although OPTIMIZE FOR UNKNOWN or specified values seem to work fairly well in my experience, though obviously not always.
The real failing is that a common solution to a view query hitting the compiler timeout is to replace it with a stored procedure of some kind. However, if you're not careful you'll run into the parameter sniffing problem with stored procedures! So you run into one caveat and your attempted solution runs into the other one.
I found SQL Server's query optimiser to be magical, but it relies on its table statistics being somewhat correct. Every now and again they liked to suddenly become wrong enough that queries go from magical to catastrophic mush.
(Most recent version I've used was SQL Server 2008.)
All cost based optimisers rely on statistics being correct. And inevitably there will be a case where they aren't. The problem with PostgreSQLs optimiser, and I assume with others too, is that it's too risk happy. It's optimizing for the average case based on the statistics, when people actually tend to care about the worst case.
As an example, say you have a database of all cars ever produced with indexes on model and production date. If you are looking for the latest Ford F-150 then the best plan is to just start looking backwards by date and you will find one soon enough. Much faster than looking up all F-150s and picking the latest one. On the other hand, if you are looking for latest Ford Model-T, that plan is going to be catastrophically terrible, going through 93 years of car production before finding the correct one.
Your example illustrates perfectly how a descriptive query language simply can't work in the general case for something other than ad-hoc workflows where performance isn't a big concern. Why are we all doing this instead of engineering DB access procedures directly? This is one thing I really liked about CouchDB and its map/reduce based access - stop trying to be smart with queries, instead be dead simple to scale horizontally. Granted, it's hard to design the data structures in the design space of normalization (many joins) vs. read performance (few joins) - but at least it's all laid bare to reason about.
I somewhat agree. Not with the sweeping that declarative query languages can't work, for every case where faulty statistics cause a bad plan there are likely to be many cases where a statistic based plan choice correctly switched plans due to data distribution changes.
But I do agree that PostgreSQL has way too little tools to nail down the performance even though they have downsides. Tools like pinning execution plans would be nice, as it has less severe worst case behaviors. As would be the ability to just pass the execution plan directly, although that would have severe cross-version compatibility implications and security will also be hard to nail down after the fact because of all the "can't happen" assumptions sprinkled around in executor code. And even just plain plan hints would be great to have, be it the heavy handed "join in this order", "use this index", or the more graceful "this clause is way less selective than you think", "this clause is functionally dependent on that one" or "assume there is correlation between ordering and predicates".
I'd put it this way: It could be made to work, but IMO all current RDBMS have the same problems with terrible worst case guarantees. Query planners have not enough degrees of freedom - maybe with a new concept like automated indices it could be made to work with acceptable average and worst case guarantees.
Sorry, but hard disagree. For all the failings of SQL systems, they work staggeringly well 99% of the time as long as the operator understands how they work. No-SQL systems do have their place, but all too often their choice stems from a failure to understand SQL-based databases. (And the hard truth: unless you’re doing something especially novel there’s a lot less difference between map/reduce and a bog standard table index than no-SQL proponents would have you believe.)
In my estimation, if you're not already intimately familiar with the intricacies of performant SQL, chances are you're not playing in a space where a no-SQL architecture is an appropriate fit anyway. And you're certainly not in a position to make an informed decision between SQL and no-SQL. It's a much better use of this hypothetical person's time to set up Mariadb or Sqlite and do a deep dive into the fundamentals query performance.
Just to clarify: I'm not saying there is a significant difference between map views and indices, I'm saying I prefer the directness of programming with them over trusting the query planner and table statistics. In couch, to do a table scan, I need to program it as such, otherwise I'm forced to use a view - things like forgotten indices for some edge cases are impossible. Sure takes longer to do things, but when performance issues arise it's easy to reason about while with SQL based DBs I've seen a lot of time being spent at that stage. They work well 99%, and then that remaining 1% takes 99% of your time.
> All cost based optimisers rely on statistics being correct. And inevitably there will be a case where they aren't.
Every single time the SQL Server query optimiser did something obviously wrong, rebuilding statistics fixed it. The problem I had with SQL Server wasn't its reliance of statistics—it's not like they could work any other way—it's that it failed to maintain its statistics correctly. Any time that statistics stop being correct is a bug. It should be able to maintain them itself and trigger rebuilds whenever there's any doubt about them.
The only other problem is the parameter sniffing problem for stored procedures, although OPTIMIZE FOR UNKNOWN or specified values seem to work fairly well in my experience, though obviously not always.
The real failing is that a common solution to a view query hitting the compiler timeout is to replace it with a stored procedure of some kind. However, if you're not careful you'll run into the parameter sniffing problem with stored procedures! So you run into one caveat and your attempted solution runs into the other one.