| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by phorese 4573 days ago

The time regex is, too. In German you can expect to encounter the text fragment

> um 6:00 am 05.12. (at 6:00 on 12/05/..)

If i read it correctly, the time regex would extract "6:00 am" as time, but the "am" is wrong (German uses 24h format).

3 comments

jrabone 4573 days ago

Haha that's excellent. Reminds me of a normalisation rule I wrote as part of a larger system to convert "Joe Bloggs Md." into "Dr. Joe Blogs MD" (where MD is Medical Doctor). TIL that "Md." is a common abbreviation for "Mohammed" in large parts of the world...

link

petepete 4573 days ago

The text-to-speech system in use at my local GP's surgery (that announces to patients which which rooms they need to go to) pronounces 'Dr' as 'Drive', rather than 'Doctor'. I thought someone would have tested that!

link

abolibibelot 4573 days ago

When dealing with this general problem, the proper tool would more likely be language/culture detection + Named-Entity Recognition.

Simple regular expressions can be good enough if you're aware of the domain restriction though.

link

madisonmay 4573 days ago

Yeah, I threw together this module intending to supplement NER on a text classification project I'm currently working on, not as a replacement for NER.

link

sokoloff 4573 days ago

"Um 12:30 am..." seems like a better example, since 6:00 is still 0600, but one parsing would make 12:30 am into 0030 rather than the intended 1230.

link

phorese 4573 days ago

Heh, thanks. I only looked at misparsing and didn't think about the consequences :D

link