Hacker News new | ask | show | jobs
by gioark 3292 days ago
hi, I am the author of the article.

@devhead: To import the data into ES we used a custom application to extract the text from the OCR'd documents. This is required to support our bookreader software. A complete ingestion takes a few days; we rate-limit indexing in order not to overload the cluster, and maintain reasonable search performance.

5 comments

Hey, I'm wondering why you didn't consider using stopwords to prevent bloated inverted index entries fir words like 'the'?
We don't use stopwords because we want to find all the best and complete matches. We don't want to ignore any of the words part of the search query.
You do use stopwords. Your most common unigrams are not in the index, by design. You just use your own stopwords.
Great article man. Am always​ super exited when i find articles that talk about information retrieval systems. I have a lot of questions for you. Been working on a search engine project www.cognifly.com for a year and its inverted index is still very small. Like 4gig now. So is ok if i send you an email?
sure, feel free to contact me by email or DM on twitter. gio archive org
It was a great article. On a side note, holy shit is it rare to even hear of other guys named Giovanni, much less as similar a last name (I'm Giovanni d'Amelio).
Wow those ASCII tables look terrible on the iPhone. If I rotate they clean up but when vertical they are unreadable.
cool, thanks for sharing; maybe one day you can release your ingestion app to the world.