Hacker News new | ask | show | jobs
by modalduality 3229 days ago
Good article (didn't realize there were other kinds of lookaround), but maybe the bottom should link to well-tested standards-based regexes instead.

    URL: ^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$
I recently encountered a case where a URL had an underscore at the end of a subdomain name. It seems underscores are okay anywhere else, but while my friend on Windows was able to load the website, I wasn't (on Linux) using Firefox, curl, remote screenshot service which presumably ran Linux etc. According to various RFCs, they should be okay anywhere within the subdomain name.

Has anyone encountered this behavior? Couldn't find anything on the internet; maybe it's just my computer?

4 comments

It seems to mostly come down to differences in how things are defined. DNS itself can handle almost arbitrary data https://tools.ietf.org/html/rfc2181#section-11, while an Internet Hostname was defined to be more strict https://tools.ietf.org/html/rfc1123#section-2. The same issue also exists with dashes at the end of domain components.

I'm not enough of a history boffin to know how Microsoft came to support it differently (perhaps something from the Netbios and NT era). At this point in time though, I don't see either party changing their default validations to agree on a single definition.

Edit: If you're curious, this is the first commit that appears to be the first glibc commit limiting dashes at the end of URLS https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fa0bc.... I don't know about BSD libc, or windows however.

Wait a second...does this imply that if I put downloads that should only be of interest to our Windows customers on a server named something like downloads_.ourdomain.com, it might keep out all those annoying bots that ignore robots.txt and make a lot of noise in my logs? I'm guessing that most of the bots are not running on Windows.
That's a pretty bad idea, you shouldn't rely on this kind of stuff.

If there are people running OSX or Linux that want Windows downloads, or someone is behind a captive portal or proxy (like squid), they probably won't be able to reach it anymore.

If you have a real problem with bots, I'd look at what IPs they are coming from, and how often they try to connect. Something like IP blacklisting, or fail2ban might work for your use case.

Wow, how did you find that commit?
Both git and this git web view allow you to view all the commits that have modified just that file. Eg. https://sourceware.org/git/?p=glibc.git;a=history;f=resolv/r.... So it's a simple matter of looking at the diffs between commits.

Of course that's assuming you know the right file, which is often the harder problem.

Yes, I feel that the "Bonus" section (with no explanation even) is rather encouraging beginners to mis-use regular expressions in general, and - more specifically - contains errors.
I personally avoid regexes where possible, including in this situation. IMO the right way to validate a URL is to feed it to a URL parser and see if it errors out. I can see errors in this regex right away - and in many other regexes you find from Googling. People just drop them into their codebase and their eyes glaze over when you ask them whether or not it's actually correct. How many websites fail on user+whatever@gmail.com because they copied a bad regex?
hmm interesting, do you still have the domain/url? You could search for it in your history using regex :)