Hacker News new | ask | show | jobs
by jwilk 2984 days ago
This regexp lets you parse a valid URI, but it matches also a lot things at are not URIs at all.

The URI language is of course regular, so it would be possible to construct a regexp that matches only URIs. But naively applying such regexp wouldn't work in practice, because many punctuation characters are allowed in URIs. For example, single quotes are allowed, so in this Python code the regexp would match too much:

   homepage = 'http://example.com/'
1 comments

I see—The standard allows most characters we normally use to surround URIs. It sure does look like a difficult problem then, and one that a regexp can't solve.