| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by NDizzle 4515 days ago

I also took a few days a few weeks ago to setup elastic search after my mysql full text search fell apart.

What I'm doing is slamming the full text output of OCRed PDFs into a MyISAM table, the entire document in a text field.

What I'm afraid I'm not doing right is creating the web interface to search elasticsearch. What I'm using filters with the query string syntax[1] in the search box, pointing directly at that fulltext column. I'm also using the highlight functionality so that I can specify how many highlight blurbs to return with the result. The query string syntax works great with the OCR'd text, because most of it is near-garbage (as most ocr is) so you can search for something like "net sales"~50 to find those two terms within 50 words of each other. I think the results were something like: net sales 15,000 results "net sales" 120 results "net sales"~50 550 results

Can anyone point me at a good web based search implementation using elasticsearch that explains how they're doing it?

What I have works pretty good, I just want to... check my work, I guess.

[1]: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

2 comments

nzadrozny 4515 days ago

I host and support websolr.com and bonsai.io and have seen a lot of search implementations.

The main thing for good stability and performance is to be very good at batching your updates. You don't want to sling a ton of highly-parallel single-document updates at Lucene, lest you thrash the JVM and start garbage collecting like crazy.

From there, on the query side, you'll want to get a good working knowledge of the different tokenization and analysis options. There are a lot of subtle and interesting combinations to be had in there that influence performance and relevance of your search results.

link

NDizzle 4515 days ago

Do you have a demo on either of those sites where I can input terms into a search box and look at results? What explanation do you give to users as to the options available when formatting the query?

link

nzadrozny 4515 days ago

We've got a free Heroku addon that's pretty easy to spin up and play with. Elasticsearch also has an analyze[1] API that can be helpful to play around with.

It's also possible to download and install ES locally and run any number of front-end interfaces, some of which include query builders. ElasticHQ seems like a decent option for that. The venerable Elasticsearch-head is another.

I think now that ES 1.0 has shipped, more experimental tools will start to emerge that help people learn and interact with ES itself. (If anyone out there is a front-end whiz and wants to help me build something like that, please email nz@bonsai.io!)

1. http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

link

dclara 4515 days ago

May I ask what you meant about "web based search implementation using elasticsearch"?

Do you mean that you use ES to do indexing on the backend of your documents and make it available on the web? Or do you mean that you use ES to index documents available on the web and let people to search for them?

link

NDizzle 4515 days ago

Sure. Your first guess is correct - I do indexing of backend documents.

I fetch a steady stream of FOIA documents, close to the maximum possible each week, and PDF/OCR them. I expose a web interface to the analysts I work with, to help them gather up documents for further analysis.

The second guess would probably be more interesting to most people.

link

dclara 4515 days ago

Yes, then I think ES fits our application well and you should really take its advantage to provide your web interface for searching those documents.

I'm more interested in the second case, but I don't think ES fits due to the huge volume of data to be indexed.

link

NDizzle 4515 days ago

Oh - I have one! I just want to see examples of others so I can figure out ways to improve my implementation.

link