| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by artembugara 1657 days ago

We've been using spaCy a lot for the past few months.

Mostly for non-production use cases, however, I can say that it is the most robust framework for NLP at the moment.

V3 added support for transformers: that's a killer feature as many models from https://huggingface.co/docs/transformers/index work great out of the box.

At the same time, I found NER models provided by spaCy to have a low accuracy while working with real data: we deal with news articles https://demo.newscatcherapi.com/

Also, while I see how much attention ML models get from the crowd, I think that many problems can be solved with rule-based approach: and spaCy is just amazing for these.

Btw, we recently wrote a blog post comparing spaCy to NLTK for text normalization task: https://newscatcherapi.com/blog/spacy-vs-nltk-text-normaliza...

7 comments

artembugara 1657 days ago

Also I have an article about spaCy NER: https://newscatcherapi.com/blog/named-entity-recognition-wit...

The conclusion I came up with:

"A few notes on my Spacy NER accuracy with "real world" data

Low accuracy with sentences without a proper casing

1. Low accuracy overall, even with a large model

2. You'd need to fine-tune your model if you want to use it in production

3. Overall, there's no open-source high accuracy NER model that you can use out-of-a-box"

Vetch 1657 days ago

> Overall, there's no open-source high accuracy NER model that you can use out-of-a-box"

Part of it is most underestimate the complexity of NER and the rest of it, in my opinion, is that NER is not well-defined as a classification problem.

At least in my experience, having a specific battery of questions to query documents, first by transformer based semantic search and narrowed by Q/A models, removed the need for explicit NER, entity linking or relation extraction. For the case of entities as features for rule systems, shallow models and using all label predictions instead of just selecting argmax has been sufficiently robust. Using big transformers for classification doesn't pay enough to be worth it there.

wyldfire 1657 days ago

I assume your product does some kind of entity disambiguation and/or link to an ontology? Spacy doesn't provide this out of the box either, AFAICT. Can you share more info about how you do it?

artembugara 1657 days ago

We don't provide entity disambiguation out of a box. It's more of a on request for Enterprise clients.

But overall, entity disambiguation is one of the most useful and difficult tasks in the NLP.

SpaCy supports entity linking via knowledge base: https://spacy.io/api/entitylinker

nefitty 1657 days ago

That might be the killer feature from what I've heard.

Tarq0n 1657 days ago

NER good enough to anonymise free text would be the absolute dream for many governments.

pantsforbirds 1657 days ago

We use spaCy at work for (mostly) news articles as well. We've been pretty impressed with it overall for detecting larger trends using the NER models. I've been contemplating whether it might be useful to make a spaCy module that uses a Count-Min Sketch to track the top N of each of the NER categories partitioned on a daily (or weekly etc.) time.

Think it could be an interesting use case to get sort of similar results to Google's search trends.

artembugara 1657 days ago

I'd really love to chat about that. Any chance to connect? email in bio

brd 1657 days ago

I really appreciate how accessible SpaCy has made NLP work but their NER is definitely low accuracy.

Where stem/lem felt critical to successful NLP processing a few years ago, we've found stem/lem work to be much less important for downstream tasks when transformer based models are involved.

For topic extraction stem/lem still seems to do a lot to improve accuracy and for rules based approaches I can still see how it would facilitate more efficient processing at scale. I'd be curious to hear your experience fine tuning and/or training new models after stem/lem processing with transformers, we've admittedly done little testing to see how transformers actually performer if properly tuned to post-processed data.

artembugara 1657 days ago

Did you try something like autoNLP by huggingface?

brd 1657 days ago

No, we've got our own fine tuning pipeline and initial tests showed better performance without traditional stem/lem processing so we dropped it from our classification pipelines and haven't seen a need to revisit.

robbedpeter 1657 days ago

Rule based processing can augment transformers by both filtering out bad input and by parsing good input into a form that plays to the strengths of a model.

You can do some fantastic things with BERT and spaCy, or gpt-neo/J/3, or combinations as needed. Expert systems and ontological tools and things like nltk, spaCy, and LinkGrammar are excellent complements to an ai workflow. Use the fast, "dumb" tools to do the fast, dumb tasks, and only use the huge smart models when you need it.

GPT-3 shouldn't be used if you're just doing tagging or NER, but you can get higher quality nuanced extrapolation or summarization if you run things through a mad libs style prompt generator that leans into prompts that work really well.

kulikalov 1657 days ago

Are you using the high accuracy eng model for NER? I’ve been very happy with orgs recognition, it actually did way better than any other open source model in my case.

artembugara 1657 days ago

Try it on a sentence where all tokens are lower/upper case. It just doesn’t really work.

PeterisP 1656 days ago

Well, caseless text is a special scenario and not the default scenario. Case is a very strong signal for NER disambiguation, so if you want to support that, then you should apply a special model for that - because if the default model would include support for caseless text, then it would harm the accuracy for all the majority of scenarios where text actually is cased properly.

In essence, the current approaches are targeted for one domain of text over another. You can have a model that works reasonably in one scenario, or a model that works reasonably in another scenario, or an universal model that works poorly in all scenarios and thus is useless unless you really don't know what you're going to be analyzing.

You can support non-literary slang, but that comes at a cost for accuracy on literary languages. You can support multiple variants of language (e.g. for English - British, Indian, AAVE and non-AAVE American) but that comes at a cost of accuracy on any particular variant. You can support text ridden with typos, grammatical mistakes and chat-abbreviations, but that comes at a cost on correct text. The same applies for word casing. So for all of these things you try to support them if and only if you think you need them, since you don't have much of an "accuracy reserve" to sacrifice; the systems generally are barely sufficient for their use for your target domain, and they become not sufficient if you try to make them more general than you need to.

It would be nice if the default models would explicitly list their assumptions, though. Like, a model trained only on correct literary text of one language variant in proper case and not on anything else should clearly state that.

Xenoamorphous 1657 days ago

I don’t know how it compares with other paid alternatives (like Google’s or Amazon’s) but spaCy’s NER was pretty close to the (paid) service we were using (IBM) to the point we ditched IBM. Also for news articles.

But yeah disambiguation/entity linking would be nice.

artembugara 1657 days ago

I'd be happy to chat more if you want.

Eridrus 1657 days ago

I feel like NER is a poorly designed task in general. You're eventually trying to link the entities to some kind of KB, so you should be injecting that entity information into your system for detecting mentions.