|
|
|
|
|
by karanlyons
3696 days ago
|
|
The answer is maybe. You need to reduce that text to a set of features that you can pipe into a clustering algorithm, and what’ll really make or break your approach is how you convert the text. There are two types of features and related clustering algorithms: categorical and numerical. Numerical features are the most commonly supported by out of the box clustering tools (meaning most of the algorithms made easily available to you will expect numerical features). Categorical features can be reduced to binary numerical features, however, though whether that’ll make sense for your data depends: For example, if products have a set of categories they can be sorted into you may want to just use that data along with K-Modes clustering or some other approach that considers categorical features. (Though to me those don’t really seem to produce useful results.) For short text clustering I’d try an ngram frequency approach, wherein you reduce the text to a set of features describing the frequency of ngrams from, say, unigram up to trigram. You’re attempting to balance the number of features you end up needing to process with the amount of locality information you need to get a useful result. If you end up with far too many features, you could attempt to cull them with an approach such as PCA, but it may not even be necessary. Do note that I’m only a little above 0 in terms of insight or knowledge, so take my advice with a good heap of salt. Clustering of this data would group together products that are described “similarly”, but “similarly” in this case doesn’t imply the products are in fact similar, but only that they’re described in a similar manner. You may also want to explore graph clustering over vector clustering, which—while there may be no out of the box solutions readily available—is likely a much better fit for text in general. |
|