Hacker News new | ask | show | jobs
by grrrr 5890 days ago
The book by Manning (freely available online) has already been recommended. I would start with this.

In addition there are a wealth of online video lectures that may inspire you: http://www.datawrangling.com/hidden-video-courses-in-math-sc... and http://videolectures.net/mlss04_hofmann_irtm/ and http://videolectures.net/Top/Computer_Science/

In so far as search engines go it's certainly worth playing around with Lucene. It's well implemented and you'll learn a lot of what really matters when it comes to indexing and retrieval.

For the text processing (classification, data extraction) side It may also be worth brushing up on your stats (a good excuse to learn R) and checking out Mahout http://lucene.apache.org/mahout/

1 comments

It is pretty nice that the different implementations of Lucene all use the same index file formats.

There are some pretty nice tools to go with Lucene - I've used Luke quite a bit: http://code.google.com/p/luke/