Hacker News new | ask | show | jobs
by phorese 4573 days ago
The time regex is, too. In German you can expect to encounter the text fragment

> um 6:00 am 05.12. (at 6:00 on 12/05/..)

If i read it correctly, the time regex would extract "6:00 am" as time, but the "am" is wrong (German uses 24h format).

3 comments

Haha that's excellent. Reminds me of a normalisation rule I wrote as part of a larger system to convert "Joe Bloggs Md." into "Dr. Joe Blogs MD" (where MD is Medical Doctor). TIL that "Md." is a common abbreviation for "Mohammed" in large parts of the world...
The text-to-speech system in use at my local GP's surgery (that announces to patients which which rooms they need to go to) pronounces 'Dr' as 'Drive', rather than 'Doctor'. I thought someone would have tested that!
When dealing with this general problem, the proper tool would more likely be language/culture detection + Named-Entity Recognition.

Simple regular expressions can be good enough if you're aware of the domain restriction though.

Yeah, I threw together this module intending to supplement NER on a text classification project I'm currently working on, not as a replacement for NER.
"Um 12:30 am..." seems like a better example, since 6:00 is still 0600, but one parsing would make 12:30 am into 0030 rather than the intended 1230.
Heh, thanks. I only looked at misparsing and didn't think about the consequences :D