Hacker News new | ask | show | jobs
by JimDabell 1300 days ago
Depending upon your use-case, you can get pretty good results by using spaCy for named entity recognition then matching on the titles of Wikipedia articles that have coördinates.
2 comments

Agreed. That said, more often than not, as mentioned in the comment above (COVID use case), we'd look for a higher recall value in the predictions - there, NERs, although helpful, wouldn't be our go-to solution. This is exactly the reason why we open sourced the infrastructure and are rolling out the data
Tried this in the past, it's too limited... There are too many ways certain locations can be referred to. Take: New York City, NYC, NY, New York, NYCity, so on...
Wikipedia handles “New York City” and “NYC” as intended. “NY” and “New York” are ambiguous to both machines and humans (are you referring to the city or the state?) and if you have a resolution strategy for this then Wikipedia gives you the options to disambiguate. I’ve never seen “NYCity” used by anybody.
If you start processing web articles on the scale of millions you'll be surprised by how creative people can be. Not talking about tweets, just news and blog articles.
Not surprised, just not relevant. The criteria here is “you can get pretty good results”, not “you must be able to process millions of articles without failure”.
If a method is not generalizable to the entire dataset, it's not that useful.

When processing text at large scale, the usefulness of heuristic approaches like the one we're discussing diminishes rapidly.

> If a method is not generalizable to the entire dataset, it's not that useful.

No, in many situations, something doesn’t have to be perfect to be useful.

Again, I think you are missing the original point being made:

> Depending upon your use-case, you can get pretty good results by…

You seem to be responding as if I said:

> For all use-cases, you can get flawless results by…

Pointing out that this is not perfect is irrelevant to the point I was making. “Good enough” is usually good enough.