Hacker News new | ask | show | jobs
by oneeyedpigeon 2984 days ago
I guess one of the bigger challenges when it comes to unstructured data is identifying URLs. Is there a canonical way of identifying a URL embedded in text? Is it an impossible problem?
1 comments

Perhaps I’m missing the question, but there is a regular expression for matching a URI. Remove the leading carat and it can match anywhere in a text.

https://tools.ietf.org/html/rfc3986#appendix-B

Edit: I see it's not quite that simple. However, I still think that with some stricter matching requirements this could work.

This regexp lets you parse a valid URI, but it matches also a lot things at are not URIs at all.

The URI language is of course regular, so it would be possible to construct a regexp that matches only URIs. But naively applying such regexp wouldn't work in practice, because many punctuation characters are allowed in URIs. For example, single quotes are allowed, so in this Python code the regexp would match too much:

   homepage = 'http://example.com/'
I see—The standard allows most characters we normally use to surround URIs. It sure does look like a difficult problem then, and one that a regexp can't solve.