Hacker News new | ask | show | jobs
by butterm 3532 days ago
What i love about spacy is their dependency parsing visualization tool[0]. Its so much better than what Stanford offers.

Other than that, I find Spacy's philosophy of "one (best) way of doing everything" a bit stifling. I don't think there is a "best" parser or "best" named entity recognizer. A certain parser may perform very well in a domain (for example, Tweeboparser [1] performs well with tweets) and perform very badly in another. This is true for almost everything in NLP, and NLTK embraces this diversity quite well. This is why NLTK is my go to tool when I want to do something cutting edge in NLP.

[0] https://demos.explosion.ai/displacy/ [1] https://github.com/ikekonglp/TweeboParser

2 comments

I definitely agree that the same weights won't be optimal for different domains. If you need to parse tweets, you should have a tweet-trained model. The tweet model probably shouldn't be thinking about Jane Austen novels. We want to open a model store where you can buy language and domain specific models.

I think 99% of the time there's one best algorithm, and even one best implementation of it. It's the weights, and sometimes the features, that need to vary.

Finally — I love displaCy too. Ines does great work :). Have you seen that we open-sourced this recently? It's now very easy to run locally, and connect up to the model you're developing. You can use this with any other parser, too. https://explosion.ai/blog/displacy-js-nlp-visualizer

I am so glad that you guys open sourced displaCy. I would love to give it a spin on my system. Kudos for all the great work you are doing!
I love the Spacey visualization tool too. You know about http://corenlp.run/ though?

So I do a lot of work with Twitter data - to the point where I have (many) custom Word2Vec models for Twitter data.

Tweeboparser is good, and the NLTK has a basic Twitter tokenizer built in. But I still often end up using Space for stuff anyway. For example, I was building a custom distance metric to explore Tweet clusters, and the Spacy word vectors were fine to get that working.

It's true that I dropped down to using Gensim or Spark's Word2Vec model for some more complex models though.

Yup, I know about that but Displacy is just so much more beautiful.

Also, while NLTK's basic Twitter tokenizer is okay, I find that ARK's tokenizer [0] is much better. Similarly, for POS tagging of tweets, I am using the GATE POS tagger [1]. They have a Stanford model and I can hook it up with NLTK using the StanfordTagger class. In fact, this is the kind of integration that I am missing in Spacy.

[0] https://github.com/myleott/ark-twokenize-py [1] https://gate.ac.uk/wiki/twitter-postagger.html