Hacker News new | ask | show | jobs
by Arkadir 4436 days ago
NoSQL discussions always seem to conflate three very different things: storage engines, APIs and architecture. Where do we store the data ? How do we access it ? How do we make sure it scales ?

The "traditional" approach is to use Oracle/SQL Server/MySQL for storage, SQL and/or ORM as an API, and single-server tables-with-relationships as an architecture. Back in the early 2000s, everybody did this. Sure, there were a few performance-minded exceptions that went with sharding or master-slave architectures instead, but those were exceptions.

And single-server architectures tend to behave badly at medium loads. Spend the market rate for a genius DBA, and they still behave badly at high loads. The next step is a 32-core 128GB RAM monstrosity that costs an order of magnitude more than what eight 4-core 16GB servers would cost.

Most NoSQL solutions came with a new architecture. You had the MongoDB flavor of distributed storage, or the BigTable flavor of distributed storage, or the CouchDB flavor of distributed storage, and so on. Properly implemented distributed storage eats high loads for breakfast: just add more servers. This is a good thing.

My issue with the NoSQL movement is that they threw away the baby with the bath water. They threw away the single-server relational architecture, which was a nice change, and they also gave up the old battle-hardened storage engines and the highly expressive SQL language and replaced them with only-recently-experimental engines and ad hoc lean APIs.

It takes time for a storage engine to mature. To have all its performance kinks ironed out and all its bugs smoked out. I still remember the brouhaha around MongoDB persistence guarantees, or the critical data loss bugs in CouchDB.

And the lean APIs just forced back all the querying logic into the application, with all the filtering and the manual indexing and the joins and the approximate but ultimately incorrect implementations of whatever subset of ACID was required at the time. This wasn't an entirely bad thing: it certainly made many developers aware of the performance implications of some joins or transactions. But when you need to write a JOIN or GROUP BY or BEGIN TRANSACTION that you know will scale properly, and there's no API support for it ? Feh.

I'm a huge fan of the CouchDB architecture. Distributed change streams, with checkpointed views and cached reductions. But I have been burned by the CouchDB storage engine (can you say "data corÊ–NÑ %ñXtion" ?) and I see no point in bending knee to the laconic CouchDB API. So I took the CouchDB architecture and reimplemented it with a PostgreSQL back-end. It's _faster_ (don't underestimate the cost of those HTTP requests), I have trust that after PostgreSQL's decade-long history all threats to my data are long gone, and I can always whip out an SQL query when I do need it.

It's nice to see so many NoSQL solutions migrating back to an SQL-like API and gaining enough maturity to keep your data safe. In the near future, I expect them to be nothing more than "Architecture in a box" solutions for when you don't want to implement specific architectures in SQL. And I expect more and more "architecture plugins" to become available: with a library, turn a herd of SQL databases into a distributed architecture of type X.

4 comments

Single-server relational backends can survive loads well in excess of most of us will ever see for our software. There are a handful of truly web-scale companies that need specialized engineering, and the time to solve for that is when you actually have that problem in sight. Conforming SQL engines can do some amazing things with simple and declarative code, giving that up in the hopes you might be Twitter-scale one day is, in my opinion, a quite poor tradeoff.
I certainly agree that most people who pick NoSQL solutions "for scalability" never add a third server to their cluster.
Amazon offers 1-8 800 GB SSD drives in their i2 instances. RAIDing them together gives you pretty awesome performance and space.
> I took the CouchDB architecture and reimplemented it with a PostgreSQL back-end

can you provide more details on this? is there a public repo?

You can check it here: https://github.com/RunOrg/RunOrg/tree/master/server/cqrsLib I'm sorry for the state of the official site, there will be one soon.

A `Stream` corresponds to change events (I'm not sure these even had an official name in CouchDB).

A `Projection` corresponds to a design document.

The various `View` implementations are optimized for various aspects of CouchDB views, with `MapView` being the literal equivalent of a CouchDB view. Except they can be chained (you can apply a map to a map).

Unlike CouchDB, views are evaluated eagerly, though the `HardStuffCache` (and other planned `Cache` implementations) are evaluated lazily on a per-document basis.

I think you hit the nail on the head. One other thing I'd add was that these new tools forced developers to build tooling around their databases, too. No more ad-hoc query system (like sqldeveloper, or the many fine web tools) where you could explore the data visually or by issuing queries. Since the underlying data formats were entirely domain specific, building these kinds of general tools became much harder.
I think you're understating the capabilities of sql databases. Stackoverflow runs off a single dedicated ms sql server machine: http://meta.stackexchange.com/questions/10369/which-tools-an...
Your link mentions three servers with 384GB RAM and 16 RAID10, and 16 cores each. Not very far from my 128GB 32 core example :)