Hacker News new | ask | show | jobs
by tdj 4643 days ago
Actually, I think you could save yourself some trouble and use scikit-learn's built-in text preprocessing utils:

Word counter: http://scikit-learn.org/stable/modules/generated/sklearn.fea...

Hashing vectorizer if you want to trade off explainability for speed and scalability: http://scikit-learn.org/stable/modules/generated/sklearn.fea...

TF-IDF weighing: http://scikit-learn.org/stable/modules/generated/sklearn.fea...

Also, if you transform bag-of-words vectors into a dense form, you're gonna have a bad time (insert appropriate meme picture here). In large corpora, dimensionality grows quite substantially - if you work with news corpora or Wikipedia, you're in the 100k-1M dimensional space pretty quickly.

Great to see an approachable explanation for NLP. As they say sometimes, when you know how it's done, it stops being "Artificial Intelligence".