Hacker News new | ask | show | jobs
by nl 3531 days ago
I love the Spacey visualization tool too. You know about http://corenlp.run/ though?

So I do a lot of work with Twitter data - to the point where I have (many) custom Word2Vec models for Twitter data.

Tweeboparser is good, and the NLTK has a basic Twitter tokenizer built in. But I still often end up using Space for stuff anyway. For example, I was building a custom distance metric to explore Tweet clusters, and the Spacy word vectors were fine to get that working.

It's true that I dropped down to using Gensim or Spark's Word2Vec model for some more complex models though.

1 comments

Yup, I know about that but Displacy is just so much more beautiful.

Also, while NLTK's basic Twitter tokenizer is okay, I find that ARK's tokenizer [0] is much better. Similarly, for POS tagging of tweets, I am using the GATE POS tagger [1]. They have a Stanford model and I can hook it up with NLTK using the StanfordTagger class. In fact, this is the kind of integration that I am missing in Spacy.

[0] https://github.com/myleott/ark-twokenize-py [1] https://gate.ac.uk/wiki/twitter-postagger.html