Hacker News new | ask | show | jobs
by likium 1723 days ago
Even if you built a URL validation regex that follows rfc3986[1] and rfc3987[2], you will still get user bug reports because web browsers follow a different standard.

For example, <http://example.com./> , <http:///example.com/> and <https://en.wikipedia.org/wiki/Space (punctuation)> are classified as invalid urls in the blog, but they are accepted in the browser.

As the creator of cURL puts it, there is no URL standard[3].

[1]: https://www.ietf.org/rfc/rfc3986.txt

[2]: https://www.ietf.org/rfc/rfc3987.txt

[3]: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/

3 comments

Tangentially, Youtube had a bug surface last year where adding that extra dot let you avoid all ads. Previous discussion[1]

[1] https://news.ycombinator.com/item?id=23479435

This "bug", can definitely also be known as a feature ;-)
Also nearly every paywalled media site
There might not have been a generally accepted standard then, but there is now: https://url.spec.whatwg.org/
There's also a question of what we're really trying to validate, IMHO. All of these regex patterns will tell you that a string looks like a URL, but they won't actually tell you if: There's any web server listening at that particular URL; Whether that server has the resource in that location; If that server is reachable from where you want to fetch it; etc.
> All of these regex patterns will tell you that a string looks like a URL,

yeah that's it that's what they're trying to validate

It seems like the answer is almost always yes.