| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by modalduality 3276 days ago

Good article (didn't realize there were other kinds of lookaround), but maybe the bottom should link to well-tested standards-based regexes instead.

    URL: ^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$

I recently encountered a case where a URL had an underscore at the end of a subdomain name. It seems underscores are okay anywhere else, but while my friend on Windows was able to load the website, I wasn't (on Linux) using Firefox, curl, remote screenshot service which presumably ran Linux etc. According to various RFCs, they should be okay anywhere within the subdomain name.

Has anyone encountered this behavior? Couldn't find anything on the internet; maybe it's just my computer?

4 comments

keeperofdakeys 3276 days ago

It seems to mostly come down to differences in how things are defined. DNS itself can handle almost arbitrary data https://tools.ietf.org/html/rfc2181#section-11, while an Internet Hostname was defined to be more strict https://tools.ietf.org/html/rfc1123#section-2. The same issue also exists with dashes at the end of domain components.

I'm not enough of a history boffin to know how Microsoft came to support it differently (perhaps something from the Netbios and NT era). At this point in time though, I don't see either party changing their default validations to agree on a single definition.

Edit: If you're curious, this is the first commit that appears to be the first glibc commit limiting dashes at the end of URLS https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fa0bc.... I don't know about BSD libc, or windows however.

link

tzs 3276 days ago

Wait a second...does this imply that if I put downloads that should only be of interest to our Windows customers on a server named something like downloads_.ourdomain.com, it might keep out all those annoying bots that ignore robots.txt and make a lot of noise in my logs? I'm guessing that most of the bots are not running on Windows.

link

keeperofdakeys 3276 days ago

That's a pretty bad idea, you shouldn't rely on this kind of stuff.

If there are people running OSX or Linux that want Windows downloads, or someone is behind a captive portal or proxy (like squid), they probably won't be able to reach it anymore.

If you have a real problem with bots, I'd look at what IPs they are coming from, and how often they try to connect. Something like IP blacklisting, or fail2ban might work for your use case.

link

modalduality 3276 days ago

Wow, how did you find that commit?

link

keeperofdakeys 3276 days ago

Both git and this git web view allow you to view all the commits that have modified just that file. Eg. https://sourceware.org/git/?p=glibc.git;a=history;f=resolv/r.... So it's a simple matter of looking at the diffs between commits.

Of course that's assuming you know the right file, which is often the harder problem.

link

sambe 3276 days ago

Yes, I feel that the "Bonus" section (with no explanation even) is rather encouraging beginners to mis-use regular expressions in general, and - more specifically - contains errors.

link

Sir_Cmpwn 3276 days ago

I personally avoid regexes where possible, including in this situation. IMO the right way to validate a URL is to feed it to a URL parser and see if it errors out. I can see errors in this regex right away - and in many other regexes you find from Googling. People just drop them into their codebase and their eyes glaze over when you ask them whether or not it's actually correct. How many websites fail on user+whatever@gmail.com because they copied a bad regex?

link

thinkMOAR 3276 days ago

hmm interesting, do you still have the domain/url? You could search for it in your history using regex :)

link

junke 3276 days ago

https://xkcd.com/1313/

link