|
|
|
|
|
by teorema
1722 days ago
|
|
This is consistent with my own experiences with topic models, although I'm left wondering to what extent these observations generalize and why. I tried to find more details in previous posts about the models used etc but couldn't find much. There's a lot of interest in overfitting with ML but it tends to focus on supervised methods; I think there's a need for more focus on unsupervised methods in general, with regard to overfitting in particular but also just in general. |
|
Overfitting is certainly part of the problem - in one of my earlier posts I talk about "conceptually spurious words," which are essentially the product of overfitting - but the more difficult problem is polysemy. I'm sure there are ways to mitigate that - expanding the feature space with POS tagging, etc. - but ultimately I think the solution is to simply avoid using a dimensionality reduction method for text classification. Supervised models are clearly the way to go - even if those "models" are just keyword dictionaries curated based on domain knowledge.