In so far as search engines go it's certainly worth playing around with Lucene. It's well implemented and you'll learn a lot of what really matters when it comes to indexing and retrieval.
For the text processing (classification, data extraction) side It may also be worth brushing up on your stats (a good excuse to learn R) and checking out Mahout http://lucene.apache.org/mahout/
There are some pretty nice tools to go with Lucene - I've used Luke quite a bit: http://code.google.com/p/luke/