| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gioark 3339 days ago
	hi, I am the author of the article. @devhead: To import the data into ES we used a custom application to extract the text from the OCR'd documents. This is required to support our bookreader software. A complete ingestion takes a few days; we rate-limit indexing in order not to overload the cluster, and maintain reasonable search performance.

5 comments

mdellabitta 3339 days ago

Hey, I'm wondering why you didn't consider using stopwords to prevent bloated inverted index entries fir words like 'the'?

link

gioark 3339 days ago

We don't use stopwords because we want to find all the best and complete matches. We don't want to ignore any of the words part of the search query.

link

aisofteng 3339 days ago

You do use stopwords. Your most common unigrams are not in the index, by design. You just use your own stopwords.

link

kampsy 3339 days ago

Great article man. Am always super exited when i find articles that talk about information retrieval systems. I have a lot of questions for you. Been working on a search engine project www.cognifly.com for a year and its inverted index is still very small. Like 4gig now. So is ok if i send you an email?

link

gioark 3339 days ago

sure, feel free to contact me by email or DM on twitter. gio archive org

link

giodamelio 3339 days ago

It was a great article. On a side note, holy shit is it rare to even hear of other guys named Giovanni, much less as similar a last name (I'm Giovanni d'Amelio).

link

bognition 3339 days ago

Wow those ASCII tables look terrible on the iPhone. If I rotate they clean up but when vertical they are unreadable.

link

devhead 3339 days ago

cool, thanks for sharing; maybe one day you can release your ingestion app to the world.

link