Elasticsearch is really awesome for searching, but what most people don't realize is that it makes a better MongoDB than MongoDB while giving you that searching too.
The one drawback ES had in the bad old days was that backup and restore was a nightmare... ESPECIALLY on AWS. The new system they introduced was so simple I was concerned about updating to it because I was SURE something would go south.
But it all just worked.
I still have the Couch to ES replication running because I'm anal like that... but really... yeah... you can do without Couchbase, Mongo et al... ES will probably do everything you need PLUS everything you can't do in the others.
As a proud user of Elastic search since the early days I'm happy to see so much progress. Never mind about the *search part of their naming it's really a database for all practical purposes, especially for web data.
to be fair, the main selling point of mongodb is that developers can access it more easily. i haven't really touched mongodb in over a year and then only for playing, but have you tried the elasticsearch filter query syntax? have you compared mongodbs syntax?
also, i have the exact opposite nitpick. people want to use it to do everything, mail indexers, file system indexers. what's the matter with web developer folks? why is it that when the next database comes around they want to use it for everything?
"....why is it that when the next database comes around they want to use it for everything?...."
Because they like a simple web stack. KISS means a faster time to market. Faster time to iterate. Faster time to fix bugs because there are fewer places those bugs can be. All of that doesn't even factor in the productivity benefits gained by not having to switch technologies from project to project.
But to be fair, ES is not some brand new database... ES has been around for a LONG time.
Just curious, if I'm using say version 0.92, how would I go about backing up my ElasticSearch instance. Besides creating a replica in a server, then "freezing" it by disconnecting the server?
This is technically a very naive approach, since a simple rsync of the data dirs will include replicas too. If you were more diligent you could check the state files in each shard directory and only copy out the primaries.
You can just google "elasticsearch rsync" to get information, and even scripts, that will do this for you. The thing is... you REALLY need to know what you're doing when you go this route.
Also, you can try the gateway feature. Gateway is actually pretty straightforward. Restore WILL be slow though. And for many scenarios ... it is not ideal. (You don't want to take a day, or even a few, to restore after a failure.)
I think the best advice is...
Update to 1.0.
Just go to 1.0 and do snapshots... you will save yourself A LOT of headaches.
I'm surprised so many people miss this. Out of the box, Elasticsearch is a distributed NoSQL store with better write consistency (and arguably performance) than MongoDB offers in its default configuration. The major missing feature was backup snapshots and restores, which 1.0 delivers—along with aggregations that more than rival MongoDBs. The team has intentionally avoided marketing themselves as a NoSQL store (was told this directly by an employee), but they're aware of the potential and have customers using it as such.
It's easy to miss. On the front page, the word "store" only occurs once, buried three page-scrolls down in the body text. Otherwise it very much gives the impression of being some kind of analytics dashboard for third-party datastores. And I didn't notice that until after I've visited the website, clicked through a few links trying to figure out what the fuss was about, then gave up and decided to read the comments here.
Probably because some store features have been missing up to 1.0, like backup/restore without knowing database internals. (yes, rsync did the job, but only because you knew the list of guarantees that makes it possible).
Also, Lucene at its core is an Index. Changing the query strategy might require reindexing. It is perfectly valid to throw data at it, build the index and throw away the source. You will just never get it back again.
While ES can be used and tuned as a store just fine, it is not necessarily its raison d'etre.
While I agree with the sentiment, I think Shay (lead ES developer) has explicitly said that he does not consider ES to be a data store... yet. I think this is mostly due to maturity.
I help run a large ES cluster (with canonical data in MySQL), and I consider this cautious attitude by the ES developers to be a good thing.
No. The choice of datastore is still incredibly complicated in the distributed world; it's all about tradeoffs really.
For example, Elasticsearch has poor availability characteristics - both because it is master-slave and because it focuses on ensuring consistency - relative to, for example, something like Riak.
I don't believe it's "master-slave" in the way you're thinking. Elasticsearch shards its indexes among all available nodes, storing replicas of each shard on separate nodes when possible. This ensures that the entire cluster is available as long as at least one replica of a shard is still online. In fact, if configured properly, it has better availability than consistency since by default it only flushes its oplog to the Lucene index segments every second (though writes aren't considered committed until they reach a quorum of nodes, so consistency is fairly good in practice as well).
It is definitely a nice, and flexible option.. it truly depends on what your needs are... If you're often updating parts of a document, MongoDB or RethinkDB may be better options. If you want integration where a lot of parts are SQL with some document ability, PostgreSQL + V8 is pretty compelling. Also, something like Cassandra may suit your needs better if you want a better and more predictable growth curve.
There's no holy grail of data storage... ElasticSearch is really nice, and if it fits your needs, more power to you.
We'll maybe some day but it is still too easy to corrupt the data or index. Recently I had a problem where the data itself was fine and searches worked correctly but it was 100x slower than it should be. It just started happening for no apparent reason and I just do basic searches on typical data. I still don't know what happened but creating a new index fixed the problem.
I had a live production logistics system running on top of Elasticsearch 0.6 (as a NoSQL database ) back in 2012. This powered one of India's largest ecommerce systems (at that time).
Elasticsearch is brilliant as a NoSQL - and if you were already using elasticsearch as a search system, you dont need to introduce yet another component into your stack.
When running a search, ES by default will not show items that have been indexed in the last 1 second. Directly getting an item by its ID doesn't have that limit though, and you can optionally set a search to force a re-index and show all items.
Other than that (which is just performance tuning, really), ES matches mongodb feature for feature, and obviously has a lot of extra power from its search heritage such as facets and percolate.
So I can't actually think of any limitations, and it's why I said ES makes a better MongoDB than MongoDB.
On ElasticSearch you have to update the whole document, no commands to manipulate them. You don't have commands like: $set, $addToSet, $pop, etc..
You need to have a good understanding of how tokenizers and analyzers work to be able to create good results for your data. I have difficulties matching documents with the exact title being searched for. On MongoDB that just works, on ElasticSearch you need to configure it.
ElasticSearch has some advantages and MongoDB others. I think they are great together. One for storage and the other for searching.
Regarding updates, you can use the Update API for partial updates, and include a script to do things like "counter += 1" or "add value to existing array".
Internally it is still reindexing the entire document, but from your application's perspective, the Update API is a lot friendlier.
>You need to have a good understanding of how tokenizers and analyzers work to be able to create good results for your data.
This is really important. Creating a proper searching experience with auto-complete which works "just like you want" can be a very painful experience with ES, specially if you are new to ES. It bite me some time ago when I was trying to achieve just that.
An interesting read, but I'd disagree with your contention that NoSQL isn't about ACID. When NoSQL databases started coming out, it was really about which CAP guarantee a database chooses to compromise. Traditional SQL databases are either partition-intolerant or become unavailable (for writes) in the event of a partition. NoSQL databases compromise on consistency. If a database is claiming to be NoSQL and have ACID transactions, they've either disproven CAP or aren't part of the new group of distributed, partition-tolerant databases that people have been calling NoSQL. It's been said for a while that NoSQL is a terrible name for that group of technologies and now that we're getting databases with a non-SQL interface but also having consistency guarantees, the name is starting to cause even more confusion.
Side note: Happy Found customer here...you guys have made it much easier to run our ES index!
The point of that section is exactly that "NoSQL" (or to make things even more confusing "NOSQL" (Not only) doesn't have a very specific meaning. Some think it rules out ACID, other's don't. Thus, you'll need to know what you need.
And database marketing tend to not be very good at pointing out what they're not good at, or actually deliver what they promise. See also: http://aphyr.com/tags/jepsen
I'm not sure you have this right. CAP says nothing about ACID - it only mentions consistency.
NoSQL was in large part about precisely what the name implies - giving up relational (SQL) data in exchange for better performance and the ability to have a distibuted store. Yes, part of this is also about being willing to trade off consistency for availability. But Elasticsearch is an example of a NoSQL store which does focus on consistency (in this case at the expense of availability and, to some extent, partition tolerance).
I could be totally wrong, but the docs you linked to do not actually conform to the geojson spec. It is geographic and it is json, but not valid geojson. The part where it says:
> Format in [lon, lat], note, the order of lon/lat here in order to conform with GeoJSON.
.. the data example below is not actually geojson. See the spec:
When I played around it, could not figure out a way to get the exact count of events in the datastore when the data was distributed in replicas. In fact, there was ticket open for this, not able to fish it out now.
Is this still a limitation? I haven't run into any use cases where this has been a problem yet. Since the default shards are 10 and 2 replicas, does that not mean each index should be able to scale up to 20 servers? I'd think that if your data grew enough that 1/10th does not fit on a server, you could do a one time maintenance and rebuild all your servers.
I have my doubts mongodb would scale up that well to 20+ servers without some maintenance as well. So I'm not sure how that's really a limitation anyone should use for choosing mongodb or ES. If you're expecting that kind of data, just make a large number of shards in your index creation as it will work fine on fewer servers too?
What I've done, and I'm not totally sure if it's a best practice is I've over-allocated the # of shards. So if I think I need 5 shards, I create 50 or 100 shards instead. Then I'll have some app logic to determine the shard a document should go to. Initially all docs will go to shard 0. Then when that's full (around 15 GB of size, depends on your RAM), then I set all docs to go to shard 1. Of course, you'll need to be careful as you dun want duplicate documents in different shards.
The benefit of this is the as your app scales, you'll search only the shards needed. So if you have just 1 shard w/ data, u can tell ElasticSearch to just search in that 1 shard.
True. I evaluated Mongo, Couch and a couple of similar solutions, but ES being a search engine from the start really convinced me, that it can be a viable database for loosely structured data.
I don't know much about MongoDB, but it's true that Elasticsearch is a great NoSQL db with support of boolean search. Netflix has a number of use cases that use Elasticsearch as such NoSQL db: http://www.slideshare.net/g9yuayon/elasticsearch-in-netflix
Definitely! We are using it in production for storing monitoring data (via sensu, if anyone is interested). It's fantastic because you can shove data into the index with a ttl of 1 year. And have a x month archival strategy for cold storage.
It's search capabilities and scalability and fantastic - were throwing GB of data into it weekly and it just soaks it up.
I would suggest that everyone who is considering one, look at both... When I looked into both, about a year and a half ago, I found that geospacial searches worked better in MongoDB at the time, and shaping my data to fit was more awkward with ElasticSearch.
That said, it's definitely worth looking into both, depending on what your needs are.
The one drawback ES had in the bad old days was that backup and restore was a nightmare... ESPECIALLY on AWS. The new system they introduced was so simple I was concerned about updating to it because I was SURE something would go south.
But it all just worked.
I still have the Couch to ES replication running because I'm anal like that... but really... yeah... you can do without Couchbase, Mongo et al... ES will probably do everything you need PLUS everything you can't do in the others.