Hacker News new | ask | show | jobs
by SeppoErviala 4569 days ago
Check out gensim if you want to do topic modeling or similarity comparisons in Python.

http://radimrehurek.com/gensim/

It has good implementations of various algorithms, some of which support streaming or dirstribution, and it allows loading and dumping data in various formats.

I've used it for building content based recommender using tf-idf, lsi and similarity index. After the index is built, queries to it are really fast. It can handle quite large corpuses with little memory.

2 comments

Second this, I'm surprised you don't read more about it here. We use it in production to recommend image searchterms based on unstructured text, and it performs better with a few lines of python code than anything our team could write in a lower level language in months. It's REALLY fast once you've built an index.

The reason for that is a pretty epic list of dependencies (have fun explaining why the prod boxes need a fortran compiler), but in terms of efficiency and speed of development it's an obvious choice.

:-)

Hopefully the SciPy & BLAS dependencies will only get easier to install from now on... Continuum Analytics received shit loads of money and some of it is going towards better scientific Python packaging, I believe.

gensim is awesome, it abstracts very complex algorithms into extremely simple function calls. The models.HdpModel class is very powerful.