Hacker News new | ask | show | jobs
by HorizonXP 4569 days ago
Serious question: what are NoSQL databases really good for? I'm only really used to relational DBs, and I'm unclear about which problems a NoSQL database is useful for.
8 comments

Take Redis for example: simple KV database with in-memory perf and a very comfortable API with option to flush data to disk periodically. Addresses a lot of interesting scenarios such as session storage, request throttling by a certain key etc. Not used as a replacement for a RDBMS most of the time, but rather as a specialized tool for a certain use case. At some point your RDBMS is already hammered hard enough, you don't need to dump everything into it, but ultimately you could, especially at first. Comes with clustering for free as long as you're aware of the risks and use it appropriately.

This is not to say you couldn't turn Postgres into something similar. Put the data on a ramfs, relax various write guarantees (lots of knobs in PG) etc.

which problems a NoSQL database is useful for

It depends in which NoSQL database. Depending on your problem, you can probably find a NoSQL DB optimised for it. It will often be unclear if that NoSQL DB is actually better than a relational database until you try it.

Examples:

High write throughput: Cassandra

Simple key-value: Redis

Text search: Solr/Elastic Search

etc..

Redis does more than simple key-value - rather than just reading and writing values, since values can be complex types like lists and dictionaries, it can insert into them, append to them, etc. It's still key-value, it's just not simple key-value!

That said, i would hesitate to describe Redis as a database at all. A key characteristic of databases is that they store every write in a durable way. Redis can checkpoint its state periodically, but as i understand it, it either can't or typically isn't used to safely keep every write. Redis is something in between a database and memcached. I doubt there's ever a situation where you have to choose between PostgreSQL/Cassandra/CouchDB and Redis; Redis is something you would use in addition to a database.

As for text search - RDBMSs have full text search, and at least in the case of PostregSQL, it seems pretty good - see slide 49 in http://es.slideshare.net/billkarwin/full-text-search-in-post... from 2009. You might not want to be leaning on your database for text search when you're at scale (for operational reasons more than performance ones), but it's a plausible way to start.

Somewhat similar to the difference between C/C++ or enterprise Java and various dynamic or "scripting" languages in terms of structure, the "S" in SQL or NoSQL. The former tends to require more up-front structural design to do anything, which is a nuisance at first when you're not sure what you will end up needing. The latter tends to be quicker to throw together and repeatedly rearrange in the early days of its evolution, which is a huge advantage to a small startup trying to figure out what business it's in, but often runs into trouble later, once the needs have stabilized and the structure becomes more of a benefit than a cost.

For areas other than structure, the difference can be more like the difference between a general purpose programming language and a domain-specific language. SQL databases are general purpose, so they offer a large, nicely integrated package of features, while NoSQL databases often sacrifice some of those features to optimize for some niche. From this perspective, the SQL database can be the best choice for a startup trying to find its niche with the NoSQL chosen as an optimization once that niche has been identified.

The fact that a new business can't be sure what its business will end up being is an argument both for and against using a SQL database.

I have been using redis as a write though cache(not just a read cache). So I used to have 1000 concurrent writes to my Postgres db. Obviously this had a negative impact on performance even with modern specs(64gb ram, 8 cores). I realized that a lot of those writes we're updating the same column for the same row but in different threads. So I basically use redis to buffer these queries and perform 1 single write instead of multiple writes for the same row.
What's the algorithm you use for consolidating those repeated writes? I've thought about doing something like this in the past, but was put off by the worry that there might be edge cases where data would be lost (e.g. insufficient writes to trigger a write through, many concurrent writes, etc.)
So I actually store the rows as serialized JSON objects in my redis cache, and when a write occurs, I update the JSON representation in the cache (if it's not in the cache, I read it from the DB and store it there). After X hours, I have a daemon that goes through all expired keys, deserializes the JSON, and executes a transaction, updating 500 objects at a time. To prevent possible race conditions, you need a mutex to allow only 1 thread to modify the cache for that row.

Each object has a field that keeps track of what fields were updated, so I can construct an update query from that.

When I read a row from the DB, I first check if it's in the write through cache, if so I retrieve it there.

Of course, this means that all writes go through my cache. There can't be any other process that updates that table that avoids writing to the cache, or else we have database inconsistency.

This improved the load in my DB server by more than ten-fold, as well as the indexing time to my ElasticSearch server.

Sounds pretty effective to me. Thanks for sharing this.

Possibly stupid question - how do you retrieve all the expired keys? I was under the impression that once keys expired, they were effectively invisible to clients?

So I actually set a soft time to expire as a value for the key. My daemon reads all objects < this time, and deletes those keys itself. The hard time to expire is set to a time much further than this to prevent the scenario of the daemon exiting for some reason, and memory overflowing.
When you remove non-local constrains (like foreign keys) you gain scalability, that enables companies like Google to query their data better.

Now, if you are doing anything more mundane (like, for example, logging every bank operation in a country) that's overkill, and non-local constrains are a great thing to have. Thus, I'd advise not using it unless you have a clear necessity.

And always keep in mind that changing yor data is hard, whatever technology you are usig. Relational databases making the change itself hard, while schemaless databases make maintaining the software that reads it hard. Schemaless does give you a small upfront benefit, for a huge cost later.

Via analogy:

RDBMS = swiss army chainsaw, can do anything although maybe painfully and slowly.

NoSQL = 5/16th inch torque pre-set wrench for tightening sma connectors to 10 inch-lbs torque. Can't do much else, but its near perfect at its tiny little area of expertise.

On a meta level its a battle between hardware and applications, where app demands have (momentarily, of course) fallen behind commodity hardware ability.

For example, XML databases are handy when you have large XML datasets that you like to query. The database indexes the XML, allowing you to execute most XPath queries and XQuery programs quickly.
It's not about indexing paths you indicate, XML databases are about indexing the whole structure (remember, XML is not relational, but are graphs). Besides that they provide XPath/XQuery processing and optimization. You can query large XML documents or sets of documents, like you'd query them with e.g. XQilla.

Of course, it is possible to implement all of this on top of existing database technology. E.g. Oracle's Berkeley DB XML is implemented on top of Berkeley DB. But, a relational database with some indexing of XML does not provide the same functionality as an XML database.

> It's not about indexing paths you indicate, XML databases are about indexing the whole structure

From my first link, MSSQL does both:

    The primary XML index indexes all tags, values, and paths within the XML instances in an XML column
    ...
    To enhance search performance, you can create secondary XML indexes. A primary XML index must first exist before you can create secondary indexes
And it supports some reasonable subset of XQuery [1].

[1] http://technet.microsoft.com/en-us/library/ms187854.aspx

Oracle for one does allow full document xml indexing (via XMLIndex) which is trivial to implement, also a rather full featured XQuery integrated with SQL.
Map-reduce style scalability. SQL databases are not good for that but can be used for that as well. However then your DB will be not normalized most probably.

But again different NoSQL DBs have different purpose.