| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by JimDabell 1348 days ago
	Depending upon your use-case, you can get pretty good results by using spaCy for named entity recognition then matching on the titles of Wikipedia articles that have coördinates.

2 comments

yachayai 1341 days ago

Agreed. That said, more often than not, as mentioned in the comment above (COVID use case), we'd look for a higher recall value in the predictions - there, NERs, although helpful, wouldn't be our go-to solution. This is exactly the reason why we open sourced the infrastructure and are rolling out the data

link

rmbyrro 1348 days ago

Tried this in the past, it's too limited... There are too many ways certain locations can be referred to. Take: New York City, NYC, NY, New York, NYCity, so on...

link

JimDabell 1348 days ago

Wikipedia handles “New York City” and “NYC” as intended. “NY” and “New York” are ambiguous to both machines and humans (are you referring to the city or the state?) and if you have a resolution strategy for this then Wikipedia gives you the options to disambiguate. I’ve never seen “NYCity” used by anybody.

link

rmbyrro 1348 days ago

If you start processing web articles on the scale of millions you'll be surprised by how creative people can be. Not talking about tweets, just news and blog articles.

link

JimDabell 1347 days ago

Not surprised, just not relevant. The criteria here is “you can get pretty good results”, not “you must be able to process millions of articles without failure”.

link

rmbyrro 1347 days ago

If a method is not generalizable to the entire dataset, it's not that useful.

When processing text at large scale, the usefulness of heuristic approaches like the one we're discussing diminishes rapidly.

link

JimDabell 1346 days ago

> If a method is not generalizable to the entire dataset, it's not that useful.

No, in many situations, something doesn’t have to be perfect to be useful.

Again, I think you are missing the original point being made:

> Depending upon your use-case, you can get pretty good results by…

You seem to be responding as if I said:

> For all use-cases, you can get flawless results by…

Pointing out that this is not perfect is irrelevant to the point I was making. “Good enough” is usually good enough.

link