Hacker News new | ask | show | jobs
by abolibibelot 4574 days ago
I'm gonna be That Other Guy and say that the date and phone regex are respectively english-language and US specific. So it's common for a narrow definition of common.
4 comments

Yep. This library is the very application and example of H. L. Mencken's:

> There is always a well-known solution to every human problem — neat, plausible, and wrong.

often paraphrased as "For every complex problem there is an answer that is clear, simple, and wrong."

The time regex is, too. In German you can expect to encounter the text fragment

> um 6:00 am 05.12. (at 6:00 on 12/05/..)

If i read it correctly, the time regex would extract "6:00 am" as time, but the "am" is wrong (German uses 24h format).

Haha that's excellent. Reminds me of a normalisation rule I wrote as part of a larger system to convert "Joe Bloggs Md." into "Dr. Joe Blogs MD" (where MD is Medical Doctor). TIL that "Md." is a common abbreviation for "Mohammed" in large parts of the world...
The text-to-speech system in use at my local GP's surgery (that announces to patients which which rooms they need to go to) pronounces 'Dr' as 'Drive', rather than 'Doctor'. I thought someone would have tested that!
When dealing with this general problem, the proper tool would more likely be language/culture detection + Named-Entity Recognition.

Simple regular expressions can be good enough if you're aware of the domain restriction though.

Yeah, I threw together this module intending to supplement NER on a text classification project I'm currently working on, not as a replacement for NER.
"Um 12:30 am..." seems like a better example, since 6:00 is still 0600, but one parsing would make 12:30 am into 0030 rather than the intended 1230.
Heh, thanks. I only looked at misparsing and didn't think about the consequences :D
I'm gonna be That Other Other Guy and say....stuff like this is the reason why most programmers I meet take ages to do anything custom. Evey body uses this framework and that library and make bulky code that could actually be implemented with 2 far more efficient lines and will struggle when the need to customize presents itself. Learn Regex...you will have a crazy powerful weapon in your arsenal.
Why is this being downvoted?

This is an excellent point. Petty substitutions like verbal expressions might be useful if you're just getting started, but ultimately it's a crutch and it's best just to learn pure regular expressions. They're not that difficult.

Same with bundling a ton of dependencies. Lots of people (especially contemporary programmers, primarily web developers) seem to be deathly afraid of writing custom code to handle a job. It's not "reinventing the wheel", it's implementing logic easily extensible within your application without the hassle of upstream, especially if you're only using a small portion of a library or framework. Using 15 libraries for a 600-line script isn't best practice, it's cowardice.

There are pros and cons to both approaches. If I see a junior developer trying to reinvent the wheel, he's probably going to build a pretty shitty wheel. The whole point of using dependencies isn't laziness or cowardice, it's leveraging others' work to save time, energy, and potential headaches caused by subpar custom implementations. Of course, that doesn't absolve us from learning the guts of our dependencies, but chastising developers for avoiding unnecessary work is absurd. The key is in learning when to import and when to DIY.
It really shouldn't have been downvoted.
The author does explicitly note that in the readme:

> Please note that this module is currently English/US specific.