Elasticsearch 1.0.0 released

Elasticsearch is really awesome for searching, but what most people don't realize is that it makes a better MongoDB than MongoDB while giving you that searching too.

bilbo0s 4515 days ago

This. A THOUSAND TIMES "This".

The one drawback ES had in the bad old days was that backup and restore was a nightmare... ESPECIALLY on AWS. The new system they introduced was so simple I was concerned about updating to it because I was SURE something would go south.

But it all just worked.

I still have the Couch to ES replication running because I'm anal like that... but really... yeah... you can do without Couchbase, Mongo et al... ES will probably do everything you need PLUS everything you can't do in the others.

diminish 4515 days ago

As a proud user of Elastic search since the early days I'm happy to see so much progress. Never mind about the *search part of their naming it's really a database for all practical purposes, especially for web data.

rjzzleep 4514 days ago

to be fair, the main selling point of mongodb is that developers can access it more easily. i haven't really touched mongodb in over a year and then only for playing, but have you tried the elasticsearch filter query syntax? have you compared mongodbs syntax?

also, i have the exact opposite nitpick. people want to use it to do everything, mail indexers, file system indexers. what's the matter with web developer folks? why is it that when the next database comes around they want to use it for everything?

bilbo0s 4514 days ago

"....why is it that when the next database comes around they want to use it for everything?...."

Because they like a simple web stack. KISS means a faster time to market. Faster time to iterate. Faster time to fix bugs because there are fewer places those bugs can be. All of that doesn't even factor in the productivity benefits gained by not having to switch technologies from project to project.

But to be fair, ES is not some brand new database... ES has been around for a LONG time.

rpedela 4514 days ago

Apache Lucene has been around awhile. ES has been around since 2010.

bilbo0s 4513 days ago

Yeah...

that's a pretty long time.

AznHisoka 4515 days ago

Just curious, if I'm using say version 0.92, how would I go about backing up my ElasticSearch instance. Besides creating a replica in a server, then "freezing" it by disconnecting the server?

The pre-Snapshot/restore method is:

- Pause indexing

- Issue a flush request

- Rsync data directories somewhere

- Resume indexing

This is technically a very naive approach, since a simple rsync of the data dirs will include replicas too. If you were more diligent you could check the state files in each shard directory and only copy out the primaries.

bilbo0s 4515 days ago

Polyfractal is right.

You can just google "elasticsearch rsync" to get information, and even scripts, that will do this for you. The thing is... you REALLY need to know what you're doing when you go this route.

Also, you can try the gateway feature. Gateway is actually pretty straightforward. Restore WILL be slow though. And for many scenarios ... it is not ideal. (You don't want to take a day, or even a few, to restore after a failure.)

I think the best advice is...

Update to 1.0.

Just go to 1.0 and do snapshots... you will save yourself A LOT of headaches.

kainosnoema 4515 days ago

I'm surprised so many people miss this. Out of the box, Elasticsearch is a distributed NoSQL store with better write consistency (and arguably performance) than MongoDB offers in its default configuration. The major missing feature was backup snapshots and restores, which 1.0 delivers—along with aggregations that more than rival MongoDBs. The team has intentionally avoided marketing themselves as a NoSQL store (was told this directly by an employee), but they're aware of the potential and have customers using it as such.

nkoren 4515 days ago

It's easy to miss. On the front page, the word "store" only occurs once, buried three page-scrolls down in the body text. Otherwise it very much gives the impression of being some kind of analytics dashboard for third-party datastores. And I didn't notice that until after I've visited the website, clicked through a few links trying to figure out what the fuss was about, then gave up and decided to read the comments here.

Argorak 4515 days ago

Probably because some store features have been missing up to 1.0, like backup/restore without knowing database internals. (yes, rsync did the job, but only because you knew the list of guarantees that makes it possible).

Also, Lucene at its core is an Index. Changing the query strategy might require reindexing. It is perfectly valid to throw data at it, build the index and throw away the source. You will just never get it back again.

While ES can be used and tuned as a store just fine, it is not necessarily its raison d'etre.

gibrown 4515 days ago

While I agree with the sentiment, I think Shay (lead ES developer) has explicitly said that he does not consider ES to be a data store... yet. I think this is mostly due to maturity.

I help run a large ES cluster (with canonical data in MySQL), and I consider this cautious attitude by the ES developers to be a good thing.

spooneybarger 4515 days ago

He has indeed said that. We hosted the Elasticsearch meetup in NYC a couple weeks ago and specifically said it.

camus2 4515 days ago

did not know all that stuff, could Elasticsearch be the holy grail of document stores ?

room271 4515 days ago

No. The choice of datastore is still incredibly complicated in the distributed world; it's all about tradeoffs really.

For example, Elasticsearch has poor availability characteristics - both because it is master-slave and because it focuses on ensuring consistency - relative to, for example, something like Riak.

kainosnoema 4512 days ago

I don't believe it's "master-slave" in the way you're thinking. Elasticsearch shards its indexes among all available nodes, storing replicas of each shard on separate nodes when possible. This ensures that the entire cluster is available as long as at least one replica of a shard is still online. In fact, if configured properly, it has better availability than consistency since by default it only flushes its oplog to the Lucene index segments every second (though writes aren't considered committed until they reach a quorum of nodes, so consistency is fairly good in practice as well).

tracker1 4515 days ago

It is definitely a nice, and flexible option.. it truly depends on what your needs are... If you're often updating parts of a document, MongoDB or RethinkDB may be better options. If you want integration where a lot of parts are SQL with some document ability, PostgreSQL + V8 is pretty compelling. Also, something like Cassandra may suit your needs better if you want a better and more predictable growth curve.

There's no holy grail of data storage... ElasticSearch is really nice, and if it fits your needs, more power to you.

rpedela 4513 days ago

We'll maybe some day but it is still too easy to corrupt the data or index. Recently I had a problem where the data itself was fine and searches worked correctly but it was 100x slower than it should be. It just started happening for no apparent reason and I just do basic searches on typical data. I still don't know what happened but creating a new index fixed the problem.

sandGorgon 4515 days ago

I had a live production logistics system running on top of Elasticsearch 0.6 (as a NoSQL database ) back in 2012. This powered one of India's largest ecommerce systems (at that time).

Elasticsearch is brilliant as a NoSQL - and if you were already using elasticsearch as a search system, you dont need to introduce yet another component into your stack.

axefrog 4515 days ago

What limitations should one be aware of that would make ElasticSearch not a viable candidate where something like MongoDB would be a better fit?

When running a search, ES by default will not show items that have been indexed in the last 1 second. Directly getting an item by its ID doesn't have that limit though, and you can optionally set a search to force a re-index and show all items.

Other than that (which is just performance tuning, really), ES matches mongodb feature for feature, and obviously has a lot of extra power from its search heritage such as facets and percolate.

So I can't actually think of any limitations, and it's why I said ES makes a better MongoDB than MongoDB.

alisson 4515 days ago

On ElasticSearch you have to update the whole document, no commands to manipulate them. You don't have commands like: $set, $addToSet, $pop, etc..

You need to have a good understanding of how tokenizers and analyzers work to be able to create good results for your data. I have difficulties matching documents with the exact title being searched for. On MongoDB that just works, on ElasticSearch you need to configure it.

ElasticSearch has some advantages and MongoDB others. I think they are great together. One for storage and the other for searching.

http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

Regarding updates, you can use the Update API for partial updates, and include a script to do things like "counter += 1" or "add value to existing array".

Internally it is still reindexing the entire document, but from your application's perspective, the Update API is a lot friendlier.

alisson 4515 days ago

Thanks for pointing that out, it will be really useful!

xtracto 4515 days ago

>You need to have a good understanding of how tokenizers and analyzers work to be able to create good results for your data.

This is really important. Creating a proper searching experience with auto-complete which works "just like you want" can be a very painful experience with ES, specially if you are new to ES. It bite me some time ago when I was trying to achieve just that.

hkon 4514 days ago

Care to elaborate? What were the steps you had to go through?

scorpion032 4512 days ago

If for storage of data, I'd use and only use a RDBMS like Postgres. Not Mongo.

I can't comment much on MongoDB, but I've written a bit things to keep in mind when considering Elasticsearch as a NoSQL store here: https://www.found.no/foundation/elasticsearch-as-nosql/

curun1r 4515 days ago

An interesting read, but I'd disagree with your contention that NoSQL isn't about ACID. When NoSQL databases started coming out, it was really about which CAP guarantee a database chooses to compromise. Traditional SQL databases are either partition-intolerant or become unavailable (for writes) in the event of a partition. NoSQL databases compromise on consistency. If a database is claiming to be NoSQL and have ACID transactions, they've either disproven CAP or aren't part of the new group of distributed, partition-tolerant databases that people have been calling NoSQL. It's been said for a while that NoSQL is a terrible name for that group of technologies and now that we're getting databases with a non-SQL interface but also having consistency guarantees, the name is starting to cause even more confusion.

Side note: Happy Found customer here...you guys have made it much easier to run our ES index!

Thanks for the feedback!

The point of that section is exactly that "NoSQL" (or to make things even more confusing "NOSQL" (Not only) doesn't have a very specific meaning. Some think it rules out ACID, other's don't. Thus, you'll need to know what you need.

And database marketing tend to not be very good at pointing out what they're not good at, or actually deliver what they promise. See also: http://aphyr.com/tags/jepsen

room271 4515 days ago

I'm not sure you have this right. CAP says nothing about ACID - it only mentions consistency.

NoSQL was in large part about precisely what the name implies - giving up relational (SQL) data in exchange for better performance and the ability to have a distibuted store. Yes, part of this is also about being willing to trade off consistency for availability. But Elasticsearch is an example of a NoSQL store which does focus on consistency (in this case at the expense of availability and, to some extent, partition tolerance).

http://docs.mongodb.org/manual/applications/geospatial-index...

sjs382 4515 days ago

I'm not sure if ElasticSearch does anything like this, but I make use of MongoDB's GeoJSON queries, namely the $geoIntersects operator.

sjs382 4515 days ago

Wow, it looks like they do... http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

In addition to the various geo filters/queries, there are also two aggregations for geo related stuff:

Geohash Grid: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

Geodistance: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

morganherlocker 4515 days ago

Might not matter, but they do not follow the geojson spec for spatial storage.

http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

Argorak 4514 days ago

Sure, ES supports lat/lon as properties, strings, geohash and geojson:

abhirama 4514 days ago

When I played around it, could not figure out a way to get the exact count of events in the datastore when the data was distributed in replicas. In fact, there was ticket open for this, not able to fish it out now.

presharding

You create a number of shards for each index(database) that you can't later expand.

Is this still a limitation? I haven't run into any use cases where this has been a problem yet. Since the default shards are 10 and 2 replicas, does that not mean each index should be able to scale up to 20 servers? I'd think that if your data grew enough that 1/10th does not fit on a server, you could do a one time maintenance and rebuild all your servers.

I have my doubts mongodb would scale up that well to 20+ servers without some maintenance as well. So I'm not sure how that's really a limitation anyone should use for choosing mongodb or ES. If you're expecting that kind of data, just make a large number of shards in your index creation as it will work fine on fewer servers too?

you can grow a little larger than that by using some nodes only for aggregating/handling queries(holding no data/shards)

larger number of shards=slower searching (unless you distribute the shards to multiple nodes)

AznHisoka 4515 days ago

What I've done, and I'm not totally sure if it's a best practice is I've over-allocated the # of shards. So if I think I need 5 shards, I create 50 or 100 shards instead. Then I'll have some app logic to determine the shard a document should go to. Initially all docs will go to shard 0. Then when that's full (around 15 GB of size, depends on your RAM), then I set all docs to go to shard 1. Of course, you'll need to be careful as you dun want duplicate documents in different shards.

The benefit of this is the as your app scales, you'll search only the shards needed. So if you have just 1 shard w/ data, u can tell ElasticSearch to just search in that 1 shard.

aquadrop 4515 days ago

So, what happens when you fill up the last shard?

look: routing_field

also changing indexed-fields on the go

mtrn 4515 days ago

True. I evaluated Mongo, Couch and a couple of similar solutions, but ES being a search engine from the start really convinced me, that it can be a viable database for loosely structured data.

g9yuayon 4515 days ago

I don't know much about MongoDB, but it's true that Elasticsearch is a great NoSQL db with support of boolean search. Netflix has a number of use cases that use Elasticsearch as such NoSQL db: http://www.slideshare.net/g9yuayon/elasticsearch-in-netflix

ErrantX 4514 days ago

Definitely! We are using it in production for storing monitoring data (via sensu, if anyone is interested). It's fantastic because you can shove data into the index with a ttl of 1 year. And have a x month archival strategy for cold storage.

It's search capabilities and scalability and fantastic - were throwing GB of data into it weekly and it just soaks it up.

tracker1 4515 days ago

I would suggest that everyone who is considering one, look at both... When I looked into both, about a year and a half ago, I found that geospacial searches worked better in MongoDB at the time, and shaping my data to fit was more awkward with ElasticSearch.

That said, it's definitely worth looking into both, depending on what your needs are.

obastemur 4514 days ago

"most people don't realize is that it makes a better MongoDB than MongoDB "

(IMHO) Unfortunately for most of the people, old habits to be made. Indeed a nice project and great release.

m0th87 4515 days ago

It was two weeks ago, and our startup was on the precipice of a major launch. We had completely rewritten our online publication site, which drives the bulk of our traffic. The product had to be shipped on-time - we had press releases, eager investors and a launch party dependent on it.

A few days before launch, things were not looking good. As admins manipulated articles in preparation for the launch, the servers kept crashing.

In a time-constrained major launch like this, a lot of nasty little hacks build up in the codebase. Our search system for admins was a complete mess. It was a custom solution that worked fine when admins managed a handful of database records, but now that they were managing thousands of articles, it was not scaling at all.

At the 11th hour, we dropped elasticsearch into our infrastructure. It worked like a charm. The servers stopped crapping out, and we launched on time.

Elasticsearch mostly "just works", and we didn't have to worry about complex schema definitions, working with giant complex XML files (hello Solr), or build anything on top to interface between the index and the queries themselves (Lucene). Thanks elasticsearch, you saved us!

[0] https://cwiki.apache.org/confluence/display/solr/SolrCloud

dc2447 4515 days ago

> Elasticsearch mostly "just works", and we didn't have to worry about complex schema definitions, working with giant complex XML files (hello Solr)

If you were using Solr there are a few operational modes to run in. Config file based or SolrCloud[0]. The latter is more akin the ES in terms of cluster management.

I agree though from an simplicity of deployment perspective at scale ES is has a much lighter learning curve.

acdha 4515 days ago

SolrCloud is nothing like ES in terms of management: you end up running a separate zookeeper service with even more files which all have to be configured correctly just to get it running and you have to micromanage shard allocation to ensure that you can add nodes in the future but also not have it intentionally deadlock when a server fails and you no longer have enough nodes for a quorum. All of this happens with the usual contempt for sysadmins where things you need to know (“refusing to process requests”) won't be logged but a bunch of startup boilerplate will be, and simply configuring logging correctly requires (IIRC) editing two XML files and a properties file.

`java -jar elasticsearch.jar` does a better job and that's basically all it takes. I'm planning to switch as soon as https://github.com/elasticsearch/elasticsearch/issues/256 lands.

darkarmani 4513 days ago

I lost count of the +1s. That issue must have around +180. :)

troels 4515 days ago

Did you try/consider Sphinx? It's simple and it's quite fast. I'm using that and I'm pretty happy with it, but I might investigate ES at some point to see if I can squeeze a bit more speed out of it.

rch 4515 days ago

You might also take a look at the search functionality in Riak. I've run both Solr and ES, the latter at significant scale, and I'm leaning more towards Riak going forward. The difference is mainly convenience, so not a reason to switch off something that's working already.

troels 4515 days ago

Hadn't considered Riak, but I can see that it has some full-text search capabilities. Any idea about its features and how it compares in performance, as a raw search index?

[1]: http://www.christopherbiscardi.com/2014/02/07/geospatial-ind...

biscarch 4515 days ago

Riak 2.x uses Solr to index values from K/V with AAE. If you're interested in how using it looks, I wrote a post using geospatial data here[1].

rch 4515 days ago

If it's just Solr underneath, then why is the pesudo-Solr API implementation not a complete implementation? Something to do with each node being an isolated Solr instance maybe?

rch 4515 days ago

I don't know of any publicly available relative raw performance benchmarks, and haven't done any myself. My guess is that the compelling features would be more in the realm of node operations and recovery from node failures.

Edit: Apparently my Riak knowledge is dated now anyway. It looks like I have some research to do myself, but it's pretty exciting stuff.

m0th87 4515 days ago

As far as I can tell, Sphinx has a more involved setup process. Also our search runs against JSON documents, which seems to suit Elasticsearch better than Sphinx. I might be wrong on both counts though, we really didn't look into Sphinx enough to give it a fair appraisal.

nasalgoat 4515 days ago

Sphinx is a bit too 1:1 - it only works as a single server, not a cluster.

troels 4515 days ago

Well, you could simply have multiple instances running on different nodes. It's manual work, but by no means impossible. In my setup, I have a sphinx server running on the same node as my web server (Which is the consumer of the search). So they scale with each other. For more advanced uses, it's probably not adequate, but it's not a big concern of mine.

[1]: https://groups.google.com/forum/#!topic/elasticsearch/iTy9IY...

mavelikara 4515 days ago

ES seems to have ability to run analytic queries. I have read about people using it as an OLAP solution [1], although I have not yet read anyone describe their experience. In that respect how does ES analytics capabilities compare against:

1) Dremel clones [2] like Impala & Presto (for near real-time, ad hoc analytic queries over large datasets)

2) Lambda Architecture [3] systems (where queries are known up- front, but need to run against a large dataset)

Does anyone here have experience ES in such usecases, beyond the free text searching one ES is well-known for?

[2]: http://static.googleusercontent.com/media/research.google.co...

[3]: http://jameskinley.tumblr.com/post/37398560534/the-lambda-ar...

zcrar70 4515 days ago

I would also be interested in this.

Argorak 4515 days ago

Beyond the technology, Elasticsearch has a very mature, active and helpful community with users groups all over the world. We're well connected.

Pick your favourite users group here: http://elasticsearch.meetup.com/

Full disclosure: I started and run the Berlin UG. We set ourselves apart by always providing a small introduction into ES for those that are completely new and would have a hard time following the main talk.

shurane 4514 days ago

Intros to ES and other technologies are useful.

I don't see many tutorials covering usage of ES here: http://www.elasticsearch.org/tutorials/

Could you maybe provide a link to yours?

Argorak 4513 days ago

The introduction is in person, at the users group.

Yep, tutorials is a huge problem, but there are people working on that.

bryanh 4515 days ago

The thing that worried me the most about Elasticsearch was how fragile it got around the limits of its performance. Run out of memory because of a nasty query? Boom, data corrupted. I hope you weren't using it as your primary persistence layer...

Otherwise, we love ES. The other comment about it being a better Mongo than Mongo rings true. With the backup/restore API and the some of the circuit breakers, I'm hopeful that my fears will be abated.

[1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

FWIW, this is a place ES devs are spending a lot of time thinking about. For example, 1.0 introduces a new "Circuit Breaker" [1] feature which will help prevent over-eager facets from blowing out the heap. It's just one part of a very large effort to make ES handle exceptional events more gracefully (in particular, memory related).

Another example are disk-based doc values [2], which are essentially pre-computed field data structures that are stored on disk. This moves Field Data off heap and allows the OS to manage memory evictions, to help minimize GCs and OOM blowouts.

[2] http://www.elasticsearch.org/blog/disk-based-field-data-a-k-...

Ditto open file handles, which is easy to push when aggressively over-sharding. Not an uncommon mistake for the enthusiastic newbie.

Having supported Solr/ES/Lucene in production for 4+ years now (websolr.com / bonsai.io) I would be pretty hesitant to trust Lucene in general as a primary data store. Beautiful for secondary indexing, but otherwise, Why Not Postgres?™ ;)

Complexity. Having two copies of the data means more dev time, more resources required to shift the data around, etc. Having just 1 data store that can also handle all your searching is like the holy grail. As you say, not sure if Solr/ES/Lucene are there yet - but they're definitely very very close. There is no theoretical barrier either - it just comes down to closing bugs, and the ES/Lucene team are very good at closing bugs.

EDIT: I don't think MongoDB is there yet either. There are definite benefits and drawbacks between Postgres and ES, tipping heavily towards Postgres for structured heavy write data. But for ES and MongoDB? I think MongoDB falls a bit short there.

Sometimes, I actually find it easier to have more systems that do their job really well and sync things between them, rather than trying to get a single system to do everything.

For example, Postgres lets you reason about integrity, atomicity and transactional boundaries, and whether things are really safely stored with synchronous replication. If Postgres returns after a commit, I trust it. However, that requires me to have two servers working, which is harder to keep highly available.

ZooKeeper, on the other hand, I can rely on being available. But that's not really something you want to be putting lots of load on, nor try to do anything but trivial "queries". And the more servers you add, the slower writes get.

I don't trust Elasticsearch enough for those tasks, yet I wouldn't want to do searches in Postgres (Yep, I'm familiar with tsearch) even though it can. Elasticsearch is simple to scale out and awesome for searching.

Logs and metrics we shove straight into Elasticsearch, however. Other things go from ZooKeeper to Postgres and then to Elasticsearch, or from just Postgres to Elasticsearch.

Separate tools for separate jobs. I'm one of the co-founders of www.found.no, one of the hosted Elasticsearch providers . We absolutely love Elasticsearch and find new use cases for it all the time, but it's not going to be the one store to rule them all, at least not very soon.

I'd like to point out that two competing founders of hosted Elasticsearch as a service agree: ES is great, but not a general-purpose data store :-)

Hi, Nick. :)

Indeed!

That said, it's great that more people are picking up Elasticsearch for new exciting things.

Elasticsearch has really pushed what constitutes a "search problem", and deserves lots of kudos for that! :)

Sure, that's a fair point. Data consistency reliability in ES and Lucene will only get better over time.

But I personally suspect Lucene won't ever get away from the dreaded "just reindex." And to the larger point, I think recent resurgent interest in data stores and distributed systems have shown pretty clearly that there is no holy grail. No single data store can provide all the semantics necessary for all use cases. Maybe not even for most use cases. There are just too many tradeoffs to consider.

Believe me, I earn a living hosting Elasticsearch, so I'd love to see it become a robust primary data store. There are some use cases where it actually does make sense—just look at the amazing traction ES is experiencing for storing and indexing time-series data.

But as a general-purpose primary store, I'm not really holding my breath. Maybe I'm just becoming battle-worn and bitter. I would love to be proven otherwise over the next few years!

I'd like to learn from you about "general-purpose primary store". Do you mean for storing any type of data? Here is what I think regarding the case you brought up in the previous post:

ES is suitable for full-text based document indexing for enterprise level or any websites, which means they have a reasonable amount of data to be indexed in a given timeframe. A complete re-indexing won't not take for a couple of days.

So the basic idea behind the NoSQL database is to dump the data into the database quickly and return, so you can see very fast response for insert and delete. Then it will load the data into the memory to process for real-time retrieval which also produces fast response from select. I'm not sure about update.

If the data volume grows, they quickly add shards or make the number of pre-shards big enough to allocate enough memory resources to handle the queries or let the OS to swap the memories by adding more server nodes.

So if you want to use NoSQL database, you must be bound with the system requirement and make your application fit into that and take the most advantage from it. Otherwise, if you are running high structured data store, better to use relational database.

Another point is: if the documents are collected from the web like search engine, NoSQL will not fit for the large volume of data and relational database is also used to store the indexed data for fast retrieval. I guess this is what you meant "general-purpose primary store".

Correct me if I'm wrong.

berkay 4515 days ago

I think the distinction made it this comment is valuable, and echoes our experience. ES is not (at least so far) suitable as a general purpose data store, but it is suitable (and very good) for more than search. For some use cases, it's the best available.

wikyd 4515 days ago

I think the CSS for bonsai.io is not loading.

Thanks. I screwed up our CDN settings while trying to push out some changes. Working on that :-)

sandstrom 4515 days ago

This gem is from the 'breaking changes' list:

  “Geo queries used to use miles as the default unit. And we 
  all know what happened at NASA because of that decision. The
  new default unit is meters.”

I like this release already.

roryokane 4515 days ago

Link to that page: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

lflux 4515 days ago

> Easy to read, console-based insight into what is happening in your cluster. Particularly useful to the sysadmin when the alarm goes off at 3am and JSON is too difficult to read.

It's these little details I love, when a project actually cares about operations and not just "well here's the API"

I've been using ElasticSearch only for Logstash, but i've been blown away so far as how easy it is to deal with.

axionike 4515 days ago

ES has performed very well for us as the backbone for the solution we deployed for a large government-sector customer. Had some GC issues initially, and were worried about user concurrency, especially since we were not restricting queries (i.e. users can do full-scale wildcard searches against the entire data set of 1BN+ records). But ES continues to shine.

Congrats to the ElasticSearch team, and all the supporters around it. Once I get back into more of a coding role, I'll definitely be contributing back to the ES project.

room271 4515 days ago

This may require a bit more lengthy answer than makes sense here, but I'm curious about what was causing your GC issues and how you fixed them (we have GC issues at the moment).

Not the OP, but GC issues in Elasticsearch basically boil down to memory pressure (obviously), which is usually caused by facets. Facets eat a lot of memory, especially if you are faceting high-cardinality fields - think fields like "tags" or any analyzed field. High cardinality, analyzed strings is the easiest way to blow out the heap.

There are other reasons, but that is like 90% of GC issues. To solve it, you need to make sure your faceted fields are configured well (usually not_analyzed) and assess how much memory is available. You may be able to index and even full-text search ten billion docs on a single machine, but faceting it may just be too much to ask for a single node.

Omiting norms, disabling bloom filters on old indices and enabling doc values are other ways to help alleviate field-data pressure.

Other GC culprits can be: too large bulk requests, unbounded threadpool queues, or something like parent/child/scripts/filter cache keys eating all your memory. Also don't go above 30gb heaps, the JVM becomes unhappy :)

[1]: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

I also took a few days a few weeks ago to setup elastic search after my mysql full text search fell apart.

What I'm doing is slamming the full text output of OCRed PDFs into a MyISAM table, the entire document in a text field.

What I'm afraid I'm not doing right is creating the web interface to search elasticsearch. What I'm using filters with the query string syntax[1] in the search box, pointing directly at that fulltext column. I'm also using the highlight functionality so that I can specify how many highlight blurbs to return with the result. The query string syntax works great with the OCR'd text, because most of it is near-garbage (as most ocr is) so you can search for something like "net sales"~50 to find those two terms within 50 words of each other. I think the results were something like: net sales 15,000 results "net sales" 120 results "net sales"~50 550 results

Can anyone point me at a good web based search implementation using elasticsearch that explains how they're doing it?

What I have works pretty good, I just want to... check my work, I guess.

I host and support websolr.com and bonsai.io and have seen a lot of search implementations.

The main thing for good stability and performance is to be very good at batching your updates. You don't want to sling a ton of highly-parallel single-document updates at Lucene, lest you thrash the JVM and start garbage collecting like crazy.

From there, on the query side, you'll want to get a good working knowledge of the different tokenization and analysis options. There are a lot of subtle and interesting combinations to be had in there that influence performance and relevance of your search results.

Do you have a demo on either of those sites where I can input terms into a search box and look at results? What explanation do you give to users as to the options available when formatting the query?

1. http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

We've got a free Heroku addon that's pretty easy to spin up and play with. Elasticsearch also has an analyze[1] API that can be helpful to play around with.

It's also possible to download and install ES locally and run any number of front-end interfaces, some of which include query builders. ElasticHQ seems like a decent option for that. The venerable Elasticsearch-head is another.

I think now that ES 1.0 has shipped, more experimental tools will start to emerge that help people learn and interact with ES itself. (If anyone out there is a front-end whiz and wants to help me build something like that, please email nz@bonsai.io!)

May I ask what you meant about "web based search implementation using elasticsearch"?

Do you mean that you use ES to do indexing on the backend of your documents and make it available on the web? Or do you mean that you use ES to index documents available on the web and let people to search for them?

Sure. Your first guess is correct - I do indexing of backend documents.

I fetch a steady stream of FOIA documents, close to the maximum possible each week, and PDF/OCR them. I expose a web interface to the analysts I work with, to help them gather up documents for further analysis.

The second guess would probably be more interesting to most people.

Yes, then I think ES fits our application well and you should really take its advantage to provide your web interface for searching those documents.

I'm more interested in the second case, but I don't think ES fits due to the huge volume of data to be indexed.

Oh - I have one! I just want to see examples of others so I can figure out ways to improve my implementation.

xutopia 4515 days ago

I love when something I've been using in production for what seems like years just announces now that they've reached 1.0.

brickcap 4514 days ago

Well does it not make you feel glad that you took the risk? After all version is just a number :)

dabeeeenster 4515 days ago

ES is a fantastic project. Thank you thank you thank you for your contribution; truly standing on the shoulders...

jonhmchan 4515 days ago

Congrats to the team - absolutely love elasticsearch. Having a lot of fun with it here at Stack Overflow.

pron 4515 days ago

What does Elasticsearch add on top of Lucene?

lobster_johnson 4514 days ago

A lot. Lucene is basically the inverted indexes, providing on-disk structures and a mechanism to query, as well as assorted bits like tokenization.

ES adds distribution (multimaster-replicated cluster of nodes connected via a gossip protocol), sharding, defines a document model and schema (the mapping of arbitrary JSON documents to index structures), faceting, aggregation (ie., roll-up-type calculations), various types of scoring (eg., geographic distance), ETL ("rivers"), backup/restore, performance metrics, a plugin system (eg., for indexing different file formats) and a bunch of other things -- and of course a REST-based API on top of the whole thing.

https://github.com/elasticsearch/elasticsearch

buckbova 4515 days ago

I didn't know what this was and looking at this link it was tough to tell.

The github lays it out well.

alecco 4515 days ago

Why is it awesome? Why "it just works"? Is it just a mongodb-kind document store over Hadoop+Lucene?

What makes it so special to have hundreds of votes and tweets all around within 2 hours?

I don't understand. A DB engine engineer.

gibrown 4515 days ago

There are a lot of features thoughtfully combined that make ES great. Top of my list would be:

1. It handles human written language. Any language. The same technology that let's it handle strings written in human language provides a lot of flexibility in handling string in other applications. Particular when handling logs.

2. Non-string data it also handles very fast and cleanly (numbers, dates, geo).

3. Lucene has an inverted index that has been optimized over many years. ES scales that pretty seamlessly across many servers. All decisions in the project seem to be made around whether a feature can scale to 100s of nodes.

The devs have also been really smart to focus on the "out of box experience". Very well thought out defaults.

More on our experience with ES at scale: http://gibrown.wordpress.com/2014/01/09/scaling-elasticsearc...

https://lucene.apache.org/core/

buckbova 4515 days ago

Is this accurate to elastic search since it is build on Lucene?

"index size roughly 20-30% the size of text indexed"

That seems excessive for an index.

gibrown 4515 days ago

Not sure how that's calculated. I assume it is accurate, but the index size is going to depend a lot on what kind of text you have and how it is separated into individual terms (or n-grams or all the other ways you can tokenize and filter to create individual terms).

Personally, I think of disk space as cheap, and am far more concerned with having options to improve speed and quality of search results.

distributed/full-text-search(many-many-options)/highlighter/compressed/geo-queries/searching on multiple indexes(databases)|types(tables)/distributed-aggregation/distributed faceting/very-fast-in-memory-suggester/inverse-query(percolator)where you register queries(like rows), and then test documents if they match queries

and many other stuff

philfreo 4515 days ago

We wrote a tutorial about how we wrote our search for Close.io using elasticsearch and pyparsing:

"Sales data search: Writing a query parser / AST using pyparsing + elasticsearch"

Part 1: http://blog.close.io/sales-data-search-writing-a-query-parse...

Part 2: http://blog.close.io/sales-data-search-writing-a-query-parse...

karterk 4515 days ago

Elasticsearch mostly "just works". The latest version of Solr has made clustering easier (requires managing Zookeeper), but before that, it was either ES or nightmare.

Lucene is one of those projects which hardly has any real competition. That's surprising given how many real world software projects have a search requirement. While Lucene is excellent, it's not without flaws and competition is always great.

m0th87 4515 days ago

FWIW, Elasticsearch builds on Lucene. It's just working at a much higher level of abstraction.

I agree with you, almost every website needs a search server on the backend for people to search their document base, especially for enterprise intranet. Maybe enterprises are using commercial products, such as SharePoint. How about the rest of the small businesses and websites? Maybe the learning curve is steep for every website to adopt so far.

swah 4515 days ago

Hmm, could that be because they have to compete with free?

malaporte 4513 days ago

Lucene does have competition, mostly in the commercial world. I know, since I work for one of those companies :p

Solr, ElasticSearch, etc. are mostly concerned about the index/search features, and they do quite a good job there. But this still leaves a huge amount of space for commercial offerings, as core search is only a part of the problem. I'm thinking about connectivity with complex enterprise systems, support for the specific security models of those systems, integration in other systems, etc. Believe me, those problems are not easy to solve.

So, even if we have an index that can most probably match Lucene's feature for feature and quite a lot of things beside, we typically won't go after deals where simple search is the only requirement. Instead we focus on larger deals with more complex requirements. And we're doing quite well, thank you :)

Zilog 4515 days ago

Too bad they have yet to address the split brain issue.

chriscareycode 4515 days ago

I haven't had a split brain on my 15 node cluster in over 6 months even though the cluster is split among multiple data centers which do drop connectivity from time to time. When the setting was wrong, it happened constantly. Tune it properly and it won't happen. n/2+1

r00fus 4515 days ago

Link for the curious: http://blog.trifork.com/2013/10/24/how-to-avoid-the-split-br...

AznHisoka 4515 days ago

True, that's a valid issue. For me, it's not as I end up indexing the same document multiple times over the course of 2-3 days.

https://groups.google.com/d/msg/elasticsearch/Rb7Lei4gaaE/7I...

hungryblank 4514 days ago

At Contentful in Berlin (Germany) we're looking for an elasticsearch/lucene expert, if you're excited by this tool and want to work full time with it get in touch.

capkutay 4515 days ago

I was vetting ES for a business critical search platform, had some concerns about write/read performance and how the lucene indexes are handled on disk. I read that it doesn't really perform as well a splunk...Instead of ES, I'm considering a solution using HBase to shard lucene indexes on HDFS.

gane5h 4515 days ago

Really impressed with the pace of innovation in the last few months: cat api, aggregations, snapshots. The unfortunate side effect is that books and stack overflow posts written before 1.0 are outdated.

Disclaimer: I’m the founder of a hosted Search As A Service and we use ES in a few critical parts of our infrastructure.

mtrn 4515 days ago

Elasticsearch is a really great piece of software because it makes the simple easy and the complicated possible.

vhost- 4515 days ago

I'd be curious to see how well Elastic Search holds up to Endeca. I'm currently stuck maintaining some Endeca instances and it's a nightmare. I wish I could go back to ES.

At my last place of work, ES was beautiful and required little work to get a very fast, workable search in place.

quicksilver03 4514 days ago

FYI, at my shop we use Oracle Commerce (ATG) and we've seen Oracle's salespeople pushing Endeca to all current and new customers.

For our current project we went with ElasticSearch and we're quite happy. One of the contributing factors was that one of our most experienced guys was unable to get the damn thing installed, even with the help of one Endeca consultant.

pyotrgalois 4515 days ago

Great news. In every new project that we create (in general REST JSON APIs made with nodejs, erlang or rails that are consumed by iOS and android clients) we always finish using postgresql, redis and elasticsearch. Great tools.

kailuowang 4515 days ago

Congratulations to the team. This is a great library that we really appreciate.

willcodeforfoo 4515 days ago

Congrats! Elasticsearch is one of my favorite recent pieces of technology.

rartichoke 4515 days ago

ES is one of the few techs that I seriously love.

The rails support for it is amazing too. The guy creating the rails integration lib is really talented and active.

elchief 4514 days ago

Anybody know if elasticsearch does multiword synonyms properly? (Solr doesn't). Thx

skarnik 4515 days ago

congrats to the team!

dreamdu5t 4515 days ago

We recently switched from using MixPanel + Crittercism + Sphinx to using qbox.io (hosted elasticsearch) and Kibana to do all our analytics, crash reporting, and search.

I can't recommend qbox.io enough! Point-and-click scaling of managed elasticsearch clusters + Kibana == bliss.