Hacker News new | ask | show | jobs
by popra 3694 days ago
As someone intrested in the subject but with 0 insight or knowledge, would these algorithms be a good match for short text clustering? For example identifying identical products in a price comparison app based on their similar but not identical title/name and possibly other attributes.
3 comments

The answer is maybe. You need to reduce that text to a set of features that you can pipe into a clustering algorithm, and what’ll really make or break your approach is how you convert the text.

There are two types of features and related clustering algorithms: categorical and numerical. Numerical features are the most commonly supported by out of the box clustering tools (meaning most of the algorithms made easily available to you will expect numerical features). Categorical features can be reduced to binary numerical features, however, though whether that’ll make sense for your data depends: For example, if products have a set of categories they can be sorted into you may want to just use that data along with K-Modes clustering or some other approach that considers categorical features. (Though to me those don’t really seem to produce useful results.)

For short text clustering I’d try an ngram frequency approach, wherein you reduce the text to a set of features describing the frequency of ngrams from, say, unigram up to trigram. You’re attempting to balance the number of features you end up needing to process with the amount of locality information you need to get a useful result. If you end up with far too many features, you could attempt to cull them with an approach such as PCA, but it may not even be necessary.

Do note that I’m only a little above 0 in terms of insight or knowledge, so take my advice with a good heap of salt. Clustering of this data would group together products that are described “similarly”, but “similarly” in this case doesn’t imply the products are in fact similar, but only that they’re described in a similar manner.

You may also want to explore graph clustering over vector clustering, which—while there may be no out of the box solutions readily available—is likely a much better fit for text in general.

Oh wow, I still find the level of effort the HN crowd puts into being helpfull, amazing. You sir rock!
If you're interested in clustering text documents, the canonical algorithm would be latent Dirichlet allocation, which is a topic modeling algorithm. You can find latent Dirichlet allocation in sklearn; however, you're more looking for something that returns a raw similarity score it sounds like, in which case it might be interesting to check out word2vec. Perhaps checkout this stack overflow answer: https://stackoverflow.com/questions/22129943/how-to-calculat...
That you very much, I'll look into those.
In addition to everyone suggesting the classic n-gram approaches, now it is rather easy to use a word2vec (google it) representation of the words instead - obtain a mapping between words and an array of x numbers (either by finding a pretrained word2vec model on internet or training one on texts from your special domain), and then just run clustering on those numbers instead.