Hacker News new | ask | show | jobs
by axefrog 4515 days ago
What limitations should one be aware of that would make ElasticSearch not a viable candidate where something like MongoDB would be a better fit?
7 comments

When running a search, ES by default will not show items that have been indexed in the last 1 second. Directly getting an item by its ID doesn't have that limit though, and you can optionally set a search to force a re-index and show all items.

Other than that (which is just performance tuning, really), ES matches mongodb feature for feature, and obviously has a lot of extra power from its search heritage such as facets and percolate.

So I can't actually think of any limitations, and it's why I said ES makes a better MongoDB than MongoDB.

On ElasticSearch you have to update the whole document, no commands to manipulate them. You don't have commands like: $set, $addToSet, $pop, etc..

You need to have a good understanding of how tokenizers and analyzers work to be able to create good results for your data. I have difficulties matching documents with the exact title being searched for. On MongoDB that just works, on ElasticSearch you need to configure it.

ElasticSearch has some advantages and MongoDB others. I think they are great together. One for storage and the other for searching.

Regarding updates, you can use the Update API for partial updates, and include a script to do things like "counter += 1" or "add value to existing array".

Internally it is still reindexing the entire document, but from your application's perspective, the Update API is a lot friendlier.

http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

Thanks for pointing that out, it will be really useful!
>You need to have a good understanding of how tokenizers and analyzers work to be able to create good results for your data.

This is really important. Creating a proper searching experience with auto-complete which works "just like you want" can be a very painful experience with ES, specially if you are new to ES. It bite me some time ago when I was trying to achieve just that.

Care to elaborate? What were the steps you had to go through?
If for storage of data, I'd use and only use a RDBMS like Postgres. Not Mongo.
I can't comment much on MongoDB, but I've written a bit things to keep in mind when considering Elasticsearch as a NoSQL store here: https://www.found.no/foundation/elasticsearch-as-nosql/
An interesting read, but I'd disagree with your contention that NoSQL isn't about ACID. When NoSQL databases started coming out, it was really about which CAP guarantee a database chooses to compromise. Traditional SQL databases are either partition-intolerant or become unavailable (for writes) in the event of a partition. NoSQL databases compromise on consistency. If a database is claiming to be NoSQL and have ACID transactions, they've either disproven CAP or aren't part of the new group of distributed, partition-tolerant databases that people have been calling NoSQL. It's been said for a while that NoSQL is a terrible name for that group of technologies and now that we're getting databases with a non-SQL interface but also having consistency guarantees, the name is starting to cause even more confusion.

Side note: Happy Found customer here...you guys have made it much easier to run our ES index!

Thanks for the feedback!

The point of that section is exactly that "NoSQL" (or to make things even more confusing "NOSQL" (Not only) doesn't have a very specific meaning. Some think it rules out ACID, other's don't. Thus, you'll need to know what you need.

And database marketing tend to not be very good at pointing out what they're not good at, or actually deliver what they promise. See also: http://aphyr.com/tags/jepsen

I'm not sure you have this right. CAP says nothing about ACID - it only mentions consistency.

NoSQL was in large part about precisely what the name implies - giving up relational (SQL) data in exchange for better performance and the ability to have a distibuted store. Yes, part of this is also about being willing to trade off consistency for availability. But Elasticsearch is an example of a NoSQL store which does focus on consistency (in this case at the expense of availability and, to some extent, partition tolerance).

I'm not sure if ElasticSearch does anything like this, but I make use of MongoDB's GeoJSON queries, namely the $geoIntersects operator.

http://docs.mongodb.org/manual/applications/geospatial-index...

In addition to the various geo filters/queries, there are also two aggregations for geo related stuff:

Geohash Grid: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

Geodistance: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

Might not matter, but they do not follow the geojson spec for spatial storage.
Sure, ES supports lat/lon as properties, strings, geohash and geojson:

http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

I could be totally wrong, but the docs you linked to do not actually conform to the geojson spec. It is geographic and it is json, but not valid geojson. The part where it says:

> Format in [lon, lat], note, the order of lon/lat here in order to conform with GeoJSON.

.. the data example below is not actually geojson. See the spec:

http://geojson.org/geojson-spec.html

I think the documentation is not clear here. ElasticSearch has an internal GeoPoint type, which can be read from any kind of JSON document. One of the possible notations is the GeoJSON coordinate notation.

Elasticsearch can map any kind of JSON, so you can, without problems, write a mapping for proper GeoJSON points. (map "type" as unanalyzed string, map "coordinate" as GeoPoint). Arrays of values are generally supported in ES.

The biggest problem is that Elasticsearch probably does not provide all kinds of queries you'd like if you are working with complex shapes. Basically, only distance and simple location queries with polygons are supported.

When I played around it, could not figure out a way to get the exact count of events in the datastore when the data was distributed in replicas. In fact, there was ticket open for this, not able to fish it out now.
presharding

You create a number of shards for each index(database) that you can't later expand.

Is this still a limitation? I haven't run into any use cases where this has been a problem yet. Since the default shards are 10 and 2 replicas, does that not mean each index should be able to scale up to 20 servers? I'd think that if your data grew enough that 1/10th does not fit on a server, you could do a one time maintenance and rebuild all your servers.

I have my doubts mongodb would scale up that well to 20+ servers without some maintenance as well. So I'm not sure how that's really a limitation anyone should use for choosing mongodb or ES. If you're expecting that kind of data, just make a large number of shards in your index creation as it will work fine on fewer servers too?

you can grow a little larger than that by using some nodes only for aggregating/handling queries(holding no data/shards)

larger number of shards=slower searching (unless you distribute the shards to multiple nodes)

What I've done, and I'm not totally sure if it's a best practice is I've over-allocated the # of shards. So if I think I need 5 shards, I create 50 or 100 shards instead. Then I'll have some app logic to determine the shard a document should go to. Initially all docs will go to shard 0. Then when that's full (around 15 GB of size, depends on your RAM), then I set all docs to go to shard 1. Of course, you'll need to be careful as you dun want duplicate documents in different shards.

The benefit of this is the as your app scales, you'll search only the shards needed. So if you have just 1 shard w/ data, u can tell ElasticSearch to just search in that 1 shard.

So, what happens when you fill up the last shard?
look: routing_field
also changing indexed-fields on the go