|
|
|
|
|
by popra
3694 days ago
|
|
As someone intrested in the subject but with 0 insight or knowledge, would these algorithms be a good match for short text clustering? For example identifying identical products in a price comparison app based on their similar but not identical title/name and possibly other attributes. |
|
There are two types of features and related clustering algorithms: categorical and numerical. Numerical features are the most commonly supported by out of the box clustering tools (meaning most of the algorithms made easily available to you will expect numerical features). Categorical features can be reduced to binary numerical features, however, though whether that’ll make sense for your data depends: For example, if products have a set of categories they can be sorted into you may want to just use that data along with K-Modes clustering or some other approach that considers categorical features. (Though to me those don’t really seem to produce useful results.)
For short text clustering I’d try an ngram frequency approach, wherein you reduce the text to a set of features describing the frequency of ngrams from, say, unigram up to trigram. You’re attempting to balance the number of features you end up needing to process with the amount of locality information you need to get a useful result. If you end up with far too many features, you could attempt to cull them with an approach such as PCA, but it may not even be necessary.
Do note that I’m only a little above 0 in terms of insight or knowledge, so take my advice with a good heap of salt. Clustering of this data would group together products that are described “similarly”, but “similarly” in this case doesn’t imply the products are in fact similar, but only that they’re described in a similar manner.
You may also want to explore graph clustering over vector clustering, which—while there may be no out of the box solutions readily available—is likely a much better fit for text in general.