Hacker News new | ask | show | jobs
by bcaine 4076 days ago
Nice read. I did something sort of similar with the same dataset about a year ago. I compared LDA (Latent Dirichlet Allocation) to TF-IDF as tools to find similar beers based on their review text. Lots of intuitive and funny topics discovered.

I suggest you play with LDA, it seemed to work really well at generating topics. There is also a lot of fascinating, very readable research using it. Check out SNAPs work on the same dataset [1] and some of the Yelp Dataset challenge winners [2]. If you end up interested in doing so, Gensim [3] was pleasant enough to work with.

[1] http://snap.stanford.edu/data/web-BeerAdvocate.html

[2] http://www.yelp.com/dataset_challenge

[3] https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-a...